CK-12 Foundation Anoka Hennepin Probability and Statistics Say Thanks to the Authors Click http://www.ckl2.org/saythanks (No sign in required) Engelhaupt Haney Hennepin Johnson To access a customizable version of this book, as well as other interactive content, visit www.ckl2.org CK-12 Foundation is a non-profit organization with a mission to reduce the cost of textbook mate- rials for the K-12 market both in the U.S. and worldwide. Using an open-content, web-based collaborative model termed the FlexBook®, CK-12 intends to pioneer the generation and distribution of high-quality educational content that will serve both as core text as well as provide an adaptive environment for learning, powered through the FlexBook Platform®. Copyright © 2011 CK-12 Foundation, www.ckl2.org The names "CK-12" and "CK12" and associated logos and the terms "FlexBook®," and "FlexBook Platform®," (collectively "CK-12 Marks") are trademarks and service marks of CK-12 Foundation and are protected by federal, state, and international laws. Any form of reproduction of this book in any format or medium, in whole or in sections must include the referral attribution link http://www.ckl2.org/saythanks (placed in a visible location) in addition to the following terms. Except as otherwise noted, all CK-12 Content (including CK-12 Curriculum Material) is made available to Users in accordance with the Creative Commons Attribution/Non-Commercial/Share Alike 3.0 Unported (CC-BY-NC-SA) License (http://creativecommons.0rg/licenses/by-nc-sa/3.O/), as amended and updated by Creative Commons from time to time (the "CC License"), which is incorporated herein by this reference. Complete terms can be found at http://www.ckl2.org/terms. Printed: July 2, 2012 /lexboo< next generation textbtwks Authors Mr. Michael Engelhaupt, Ms. Heather Haney, Anoka Hennepin, Mr. Ernie Johnson Contributors Mr. Bruce DeWitt, Ms. Anne Roehrich, Mr. Tom Skoglund Editors Ms. Katie Bruck, Ms. Wendy Durant, Mr. Charles Nowariak, Ms. Julie Rydberg, Ms. Meghann Witchger www.ckl2.org Contents Foreword iv Preface v 1 Counting Methods 1 1.1 Sample Spaces, Events, and Outcomes 1 1.2 Fundamental Counting Principle 5 1.3 Permutations 11 1.4 Combinations 15 1.5 Mixed Combinations and Permutations 19 1.6 Chapter 1 Review 23 2 Calculating Probabilities 26 2.1 Calculating Basic Probabilities 26 2.2 Compound and Independent Events 37 2.3 Mutually Exclusive Outcomes 45 2.4 Tree Diagrams and Probability Models 52 2.5 Conditional Probabilities and 2- Way Tables 59 2.6 Chapter 2 Review 67 3 Expected Values &: Simulation 75 3.1 Probability Models & Expected Value 75 3.2 Applied Expected Value Calculations 85 3.3 Simulation and Experimental Probability 92 3.4 Chapter 3 Review 99 4 Data Collection 102 4.1 DATA 102 4.2 Sample Survey and Census 112 4.3 Random Selection 128 www.ckl2.org 11 4.4 Statistical Conclusions 134 4.5 Experiments and Observational Studies 140 4.6 Chapter 4 Review 148 5 Analyzing Univariate Data 154 5.1 Categorical Data 154 5.2 Time Plots & Measures of Central Tendency 167 5.3 Numerical Data 5.4 Numerical Data 5.5 Numerical Data 5.6 Numerical Data Dot Plots & Stem Plots 178 Histograms 190 Box Plots & Outliers 198 Comparing Data Sets 214 5.7 Chapter 5 Review 225 6 Analyzing Bivariate Data 239 6.1 Displaying Bivariate Data 239 6.2 Correlation 253 6.3 Least-Squares Regression 264 6.4 More Least-Squares Regression 275 6.5 Chapter 6 Review 281 7 The Normal Distribution 285 7.1 Introduction to the Normal Curve 285 7.2 Z-Scores, Percentiles, and Normal CDF 295 7.3 Inverse Normal Calculations 303 7.4 Chapter 7 Review 309 8 Appendices 314 8.1 Appendix A - Tables 314 8.2 Appendix B - Glossary and Index 320 8.3 Appendix C - Calculator Help 333 111 www.ckl2.org Foreword Anoka-Hennepin Schools is fortunate to have many experienced math teachers contribute to this project. These primary authors worked tirelessly during the summer to complete the formidable task of completing a textbook in 60 days. Meet the Authors Michael Engelhaupt has taught high school mathematics for 9 years. He received a Master's of Math Education from the University of Minnesota. Currently, Michael teaches Statistics and AP Statistics at Blaine High School in Blaine, MN. Heather Haney has taught high school mathematics for 19 years and currently teaches at Coon Rapids High School in Coon Rapid, MN. Heather teaches Probability and Algebra. Ernie Johnson currently teach mathematics at Andover High School in Andover, MN. He has taught mathematics for 19 years teaching courses varying from Algebra, to Statistics, to AP Calculus. Ernie graduated from the University of Minnesota in 1992 with a B.S. in Mathematics and received an M.Ed, in Instructional Systems and Technology in 1998 from the University of Minnesota. www.ckl2.org IV Preface About the Book Anoka-Hennepin Schools is thrilled to release the publication of its very own Probability and Statistics textbook. Anoka-Hennepin Probability and Statistics (First Edition) represents the work of a large team of dedicated writers and editors who have produced a truly unique and flexible "ebook." Available in multiple electronic formats, the content demonstrates 21 st century math learning at is finest. Students can access the book from a CD-ROM, DVD, flash drive, or mobile device like the Kindle or ipod. Access is also available through the web anywhere and anytime in multiple formats. Technology While paper copies are available for classroom use, the ebook is interactive and includes web site links, simulations, and real world statistical examples. Students can access the textbook through the district Learning Management Site Moodle where large amounts of supplemental and enrichment content can also be found. The ebook incorporates the use of the TI 83/84 graphing calculators and students work with spread- sheet software to display and manipulate statistical data. Additional content is available through Kahn Academy, which offers individualized problem activities with instructional videos. Find the ebook @ Http://moodle.anoka.kl2.mn.us. Coverage This foundational course covers the Minnesota Data, Analysis, and Probability benchmarks. The course also meets Anoka-Hennepin math graduation requirements. Goals From the Twins to the weather forecast statistics are used everywhere in our lives. Anoka-Hennepin Probability and Statistics demonstrates the connection between statistics and our real world. Students- Read and immerse yourself in this interactive textbook. Challenge yourself to dig deeper into the content or find solutions to your questions online. This textbook is alive and responsive to your needs. Give feedback to your teacher for incorporation into later revisions. Your input is valued going forward. Enjoy. www.ckl2.org Chapter 1 Counting Methods 1.1 Sample Spaces, Events, and Outcomes Learning Objectives • Determine the sample space for a given event or series of events • Produce an organized list of outcomes within a sample space A sample space is a list of all the possible outcomes that may occur. What might happen when you flip a coin? You will either get heads or tails. What will happen when you roll a single die? You will either get a 1, 2, 3, 4, 5, or 6. The sample space for flipping a coin is S={heads, tails}. The sample space for rolling a die is S={1,2,3,4,5,6} On a coin flip, there are two outcomes, heads and tails. There are six different outcomes when considering the event of rolling a single die. Example 1 Suppose you roll two dice. Build a 6 by 6 grid to show the different outcomes that might happen when you add the two dice together. a) What is the sample space for the different sums that you might get? b) What is the event for this situation? c) Based on your grid, which outcome occurs most often? 1 www.ckl2.org Solution ♦ 1 2 3 4 5 6 1 2 3 4 5 6 7 2 3 4 5 6 7 3 3 4 5 6 7 3 9 4 5 6 7 3 9 10 5 6 7 3 9 10 11 6 7 3 9 10 11 12 a) The sample space is S={2, 3,4, 5, 6, 7,8,9, 10, 11, 12}. b) The event is the rolling of the two dice. c) Notice that a total of 7 can occur 6 different ways. A total of 7 is the most likely outcome. Example 2 A child orders breakfast at a restaurant. The restaurant has two choices of drinks: milk and orange juice. The restaurant also has three choices of meat: sausage, ham, and bacon. Suppose the child orders one drink and one type of meat. a) Give the sample space that shows all the different outcomes for what the child might order. b) How many different outcomes are possible? Solution a) For the drinks, use M=Milk and 0=Orange Juice. For the meat, use S=Sausage, H=Ham, and B=Bacon. The child might order MS, MH, MB, OS, OH, or OB. The sample space is S={MS, MH, MB, OS, OH, OB}. This list can also be generated using a simple grid as shown below. www.ckl2.org Milk Orange Juice Sausage MS OS Ham MH OH Bacon MB OB b) There are six possible outcomes. This can be found simply by counting the number of results within the sample space. Sometimes, situations can get a bit too complex to simply make a list or build a grid. A tree diagram is a visual organizer that is very effective in handling situations with larger numbers of outcomes. We will introduce this concept here, but we will revisit tree diagrams in greater detail in section 1.2. Example 3 A dart player is trying to hit the bulls-eye with each of three darts that he will throw. Each dart will either hit the bulls-eye or miss the bulls-eye. Use a tree diagram to give the sample space for the different outcomes that may occur. Solution Build the tree diagram shown below to track what might happen. H,H,H H,H,M m^**- H,M,H Miss H,M,M JHit . M,H,H Miss M,H,M Hit___ M,M,H M,M,M The sample space is S={HHH, HHM, HMH, HMM, MHH, MHM, MMH, MMM} 3 www.ckl2.ors Problem Set 1.1 1) A single coin is flipped two times. a) Construct the sample space for this situation. b) How many different outcomes are possible? 2) A single coin is flipped three times. a) Use a tree diagram to construct the sample space for this situation. b) How many different outcomes are possible? 3) A single coin is flipped four times. a) Construct the sample space for this situation. b) How many different outcomes are possible? 4) Suppose a 4-sided die is rolled one time. What is the sample space for the result of the roll? 5) Suppose two 4-sided dice are rolled and we keep track of the total on the two dice. a) Draw a four by four grid that demonstrates the different results for the total of the two dice. b) What is the sample space for the possible totals of the two dice? H 3 \\ ■i-- 6) Suppose a 4-sided die is rolled two times and we keep track of the product when the result from the first die is multiplied by the result from the second die. a) Draw a four by four grid that demonstrates the different results for the product of the two dice. b) What is the sample space for the possible products of the two dice? c) How many different outcomes are possible for the product of the two dice? d) What outcome occurs most often? www.ckl2.org 4 1.2 Fundamental Counting Principle uf m * - - Learning Objectives • Apply the Fundamental Counting Principle to determine the number of outcomes • Create tree diagrams to represent outcomes for a series of events The Fundamental Counting Principle states that if you wish to find the number of outcomes for a given situation, simply multiply the number of outcomes for each individual event. In Example 2 in section 1.1, the child had two different choices of drink and three different choices of meat. If we multiply 2 times 3, we get 6 which is the total number of outcomes possible. The Fundamental Counting Principle expands to any number of events. For example, suppose it turned out that the child also wanted to order eggs and had a choice between scrambled and sunny-side up. The fundamental counting principle states that there are 2x3x2 or 12 ways to order this breakfast. There are other ways to visually see what is happening here. Let's use a tree diagram. Example 1 Build a tree diagram that shows the different outcomes for what the child might order for breakfast. Solution The first set of branches of the tree diagram will represent the type of drink, the second set of branches will represent the type of meat, and the third set of branches will represent the type of egg. www.ckl2.org Drink fv'eat Scra^^- Milk, Bacon, Scrambled Eggs SS-Up Juice, Ham, Sunnyside Up We have labeled the ends of two of the branches in the figure on the previous page to show what each branch means. For example, one of the labeled branches shows that the child might have ordered milk, bacon, and scrambled eggs. The Fundamental Counting Principle is critically important especially when considering complex tree diagrams. Our tree diagram above has many branches and it tracks a great deal of material. It ultimately shows us the 12 different possible breakfast orders, but it takes a large amount of organization to successfully complete. Multiplying 2 by 3 by 2 is a much quicker way to find out the total number of possible outcomes. Example 2 A couple is planning to have 3 children. Consider the different results that might occur in terms of gender. For example one outcome might be Boy, Boy, Girl (BBG). a) Using the Fundamental Counting Principle, calculate the number of different outcomes for the children in this family. b) Build a tree diagram that shows the different orders of children the couple might have. c) Construct the sample space that shows all the different orders of children the couple might have. www.ckl2.org 6 Solution a) There are 2 choices for the first child, 2 for the second, and 2 for the third. Therefore, there are 2x2x2 = 8 outcomes for the gender order of the 3 children. b) c) In order to be organized, the list will be alphabetized. BBB, BBG, BGB, BGG, GBB, GBG, GGB, GGG There are a total of 8 outcomes. There are many other ways to apply the Fundamental Counting Principle. A standard deck of cards has 52 cards as shown below. If you are dealt just one card, there are 52 different outcomes. Standard Deck of $2 Playing Cards Clubs Spades Hcara Diatnontis A* A* AT A* 2* Z4k 2* 24) 3+ 3« i* i* 4+ 1* v$ «-* 5* 5* 5* st 6* 6* 6* 6* 7+ 7* 7* 7# 3* 94k •V ft 9* 94k y* It 10+ 10* UHP 10 + Jack* Jack* JackV Jack* Queen* Queen * Queen* Queen* King* King* King* King-* www.ckl2.org Example 3 Suppose you are dealt two cards from a standard deck of 52 cards. How many different outcomes are possible? Solution We could certainly try drawing a tree diagram but that could get very large quite quickly. The first split alone would have 52 branches on it. On the other hand, if we use the Fundamental Counting Principle, we can simply calculate how many different ways we could be dealt 2 cards from a standard deck. There would be 52 choices for the 1st card and 51 choices for the 2nd card. (Once the first card is dealt, the deck only has 51 cards left in it.) There are 52x51 = 2652 ways that we could be dealt two cards from the deck. Example 4 How many different 7-digit phone numbers are possible if no phone number may begin with a zero? Solution There are a total of 10 digits available {0, 1, 2, ... 7, 8, 9}. We can't use zero for the first digit so there are only 9 choices for the 1st digit. After that, there are 10 digits available for each of the remaining six digits. This gives us 9 x 10 x 10 x 10 x 10 x 10 x 10 = 9 x 10 6 = 9, 000, 000 ways to come up with a 7 digit phone number. Example 5 A teenager is given 5 different jobs that they must do before they may go out to a movie with friends. The jobs are washing the car, starting a load of laundry, vacuuming the family room, taking out the garbage, and putting away the dishes. In how many different orders could the teenager complete these jobs? Solution There are five choices the teenager could pick for the first job. Once that job is finished, there are only 4 jobs remaining. Once the 2nd job is completed, there are only 3 choices for the 3rd job. Once the 3rd job is finished, there are only 2 choices for the 4th job and finally there will only be one choice left for the 5th job. There are 5x4x3x2x1 = 120 different orders that these jobs could be completed. Note that there is a quick way to do this ordered multiplication using factorials. 5x4x3x2x1 = 5! The 5! is read "Five Factorial". Be sure to locate the factorial key on your calculator. www.ckl2.org 8 Problem Set 1.2 Exercises 1) A woman has three skirts, five shirts, and four hats. How many different outfits can she wear if she picks one skirt, one shirt, and one hat for her outfit? 2) How many different five-digit ZIP codes are possible if the digits can be repeated? 3) How many different five-digit ZIP codes are possible if the digits cannot be repeated? 4) In how many ways can a baseball manager arrange a batting order of nine players? 5) A store manager wishes to display six different kinds of laundry soap by lining them up in a row on a shelf. In how many ways can this be done? 6) There are eight different statistics books, six different geometry books, and three different trigonometry books. In how many ways can a textbook committee select one of each book for their school? 7) At a film festival, there are eight different films that will be shown. In how many different orders can these films be shown? 8) The call letters of a radio station must have four letters. The first letter must be a K or a W. How many different call letter combinations are possible if letters may not be repeated? 9) The call letters of a radio station must have four letters. The first letter must be a K or a W. How many different call letter combinations are possible if letters may be repeated? 10) How many different four-digit ID tags can be made if repeats are allowed? 11) How many different four-digit ID tags can be made if it must start with a 7 and no repeats are allowed? 12) In how many different ways can the Harry Potter series of books (7 books total) be arranged in a row on a shelf? 13) In how many different ways can a manager select a pitcher - catcher combination if the manager has 5 pitchers and 2 catchers to choose from? 14) A coin is tossed 8 times. How many different outcomes are there for this series of 8 flips? 15) Six different colored tiles are available to make a pattern in a row of floor tile. How many possible different 4-color patterns are possible if no colors may be repeated? 16) Six different colors of tile are available to make a pattern in a row of floor tile. Many tiles of each color are available. How many 4-colored patterns can be made if colors may be repeated? 17) Four cards are dealt from a standard deck of 52 cards. In how many different orders of suit could the 9 www.ckl2.org cards be dealt? For example, one order is Club, Heart, Club, Diamond. 18) A pizza restaurant offers 6 different toppings for their pizzas. How many different pizzas are possible? 19) Use a tree diagram to find all possible outcomes for the result of a series of coin flips if the coin is flipped two times. Write a list of the possible results when complete. 20) The Super-Cool Ice Cream Shoppe sells sundaes, cones, or ice cream bars. You will pick either butterscotch or chocolate and you may choose to have it with nuts or without nuts. a) Draw a tree diagram to illustrate the different types of ice cream treats that you could order. b) How could you find the number of outcomes using the Fundamental Counting Principle? c) How many different outcomes are possible? 21) A quiz has four true/false questions on it. Use a tree diagram to show all the different possible answer keys. 22) A box contains a $1 bill, a $5 bill, and a $10 bill. Two bills are selected one after the other without replacing the first bill. Draw a tree diagram to show all possible amounts of money that may be drawn. 23) The Eagles and Hawks play each other in a hockey tournament. The first team to win two games is the champion. Use a tree diagram to show all different possible outcomes for the tournament. Review Exercises 24) Consider a situation in which a baseball manager must decide which one of four players will pitch (PI, P2, P3, or P4) and which one of 2 players will catch (CI or C2). a) What is the sample space for this situation? b) How many outcomes are possible? www.ckl2.org 10 1.3 Permutations ***** C3 Learning Objectives • Know the definition of a permutation • Be abie to caicuiate the number of permutations using the permutations formula and with technology • Understand the connection between the Fundamental Counting Principle and permutations The Fundamental Counting Principle provides us with a tool that allows us to calculate the number of outcomes possible in many situations. What if the situation is a bit more complex? For many situations, the order that we complete a task does not matter. Ordering milk, bacon, and scrambled eggs in that order is the same as ordering bacon, scrambled eggs, and milk. In this case the order that we make our choices wouldn't matter, but there are many situations in which the order that we do things does make a difference. A permutation is a specific order or arrangement of a set of objects or items. What if you wish to call someone on the phone? If I make the call, the order that I punch in the numbers matters so this is an example of a permutation. A good question to ask when deciding if your arrangement is a permutation is "DOES ORDER MATTER?" If yes, then you are dealing with a permutation. For example, if you ordered an ice cream sundae and they put the cherry in first, then the chocolate sauce, and then the ice cream, you would probably would not be happy with that particular ice cream sundae. You would likely prefer that they put the ice cream in first, then the chocolate sauce, and then put the cherry on top. Clearly each sundae had the same three ingredients, but they were quite different from one another. Each order that we can make the ice cream sundae is called a permutation. There is a simple formula for figuring out how many permutations exist when V objects are selected from a set of 'n' objects. The left side of the equation can be read "n P r", just as it looks or "n Permutations of size r". (n—r)\ 11 www.ckl2.org Recall that the exclamation point is a factorial. For example, 5! = 5x4x3x2x1. Also, be sure to find the permutations command on your calculator. In our ice cream sundae discussion, 'n' would be 3 because there are 3 items to select from and V would also be 3 because we are going to select all three items. Using the permutations formula, this would be 3P3 = tjt^tt = §i — f — 6. In other words, there are 6 different orders that the ice cream sundae could be made. Note that 0! is equal to 1. Example 1 Suppose you are going to order an ice cream cone with two different flavored scoops. You are going to take a picture of your ice cream cone for use in the school newspaper. The ice cream shop has 5 flavors to choose from; chocolate, vanilla, orange, strawberry, and mint. How many different ice cream cone photos are possible? Solution The first question to ask is "Does Order Matter?". If it does, then we are dealing with a permutation question. In this case, the order does make a difference. A chocolate on top of vanilla cone looks different than a vanilla on top of chocolate cone. We have five flavors to pick from, so n=5. We are going to select 2 flavors so r=2. 5P2 = >Am = ~H = e = 20 There (5-2)! 20 different permutations of ice cream cones we could order. The notation representing this situation, 5P2, can be read as "Five 'P' Two" or "Five permutations of size Two". Be sure to perform this calculation using your calculator as well. In the example above, you could have also found your answer using the Fundamental Counting Principle. There were 5 choices for the 1st flavor and then only 4 choices for the 2nd flavor. There are 5 x 4 = 20 ice cream cones possible. Example 2 Give the value of qP^ by using the formula for permutations. Verify your solution on your calculator. www.ckl2.org 12 Solution 6! __ 6! _ 720 -3)! 3! 6^3 = At = £ = I#i = 120 Example 3 Decide whether each of the situations below involves permutations. a) A five-card poker hand is dealt from a deck of cards. b) A cashier must give 3 pennies, 2 dimes, a 5 dollar bill, and a 10 dollar bill back as change for a purchase. c) A student is going to open a padlock that has a three number combination. d) A child has red, blue, green, yellow, and orange color crayons and will be coloring a rainbow using each color one time. Solution a) The order you get your five cards for a poker hand does not matter. If one of your cards was the ace of spades, it didn't matter if it was the first card or the last card dealt. b) The order that the cashier gives you $15.23 in change does not matter as long as the total is $15.23. c) The order you put in the three numbers for the combination makes a difference. If the correct combination is 12-27-19, the padlock will not open if you enter 19-12-27 even though the same three numbers are used. d) The order that the child colors the rainbow does make a difference. The color pattern red, blue, green, orange, yellow will look different than green, blue, red, yellow, orange. Problem Set 1.3 Exercises ■i! 1) Use the formula for Permutations, n P r = ? "' n to find the value for each expression. Confirm each result by using your calculator. a) 8^3 b) 4 />4 c) 5^3 d) 5 P 13 www.ckl2.org 2) How many 4 letter permutations can be formed from the letters in word rhombus? 3) For a board of directors composed of eight people, in how many ways can a president, vice president, and treasurer be selected? 4) How many different ID cards can be made if there are six digits on a card and no digit can be used more than once? 5) In how many ways can seven different types of laundry soap be displayed on a shelf in a store? 6) A child has four different stickers that can be placed on a model car in a vertical stack. In how many ways can this be done if each sticker is to be used only one time? 7) An inspector must select three tests to perform in a certain order on a manufactured part. He has a choice of seven tests. How many different ways can he perform three tests? 8) In how many different ways can 4 raffle tickets be selected from 50 tickets if each ticket wins a different prize? ON 032 064032 D A WI V KEEP ™S COUPON TTryrT ItflF f Lib M IVniJ M REDEEM FUR PRIZES 9) A researcher has 5 different antibiotics to test on 5 different rats. Each rat will receive exactly one antibiotic and no rat will receive the same antibiotic as any other rat. In how many different ways can the researcher administer the antibiotics? 10) There are five violinists in an orchestra. Three of them will be selected to play in a trio with a different part for each musician. In how many ways can the trio be selected? 11) There are five violinists in an orchestra. Four of them will be selected to play in a quartet with a different part for each musician. In how many ways can the quartet be selected? 12) There are five violinists in an orchestra. All five of them will be selected to play in a quintet with a different part for each musician. In how many ways can the quintet be selected? 13) There are five violinists in an orchestra. A piece of music is written so that it can be played with either 3, 4, or 5 violinists. Each musician selected to play this piece will play a different part. In how many ways can a group of at least three musicians be selected? Hint: Use your answers from problems 10), 11) and 12). 14) Decide whether each situation below involves permutations. Briefly explain your answers. a) Sophia picks three color crayons from a box of 12 crayons to make a picture for her cat Butterscotch. b) A five-digit code is needed to open up an electronic lock on a car. c) Twenty race car drivers must each complete three laps at a race track during a time trial, one after another, in order to establish the order in which the cars will start a race the next day. www.ckl2.org 14 d) There are seven steps that a student must follow when preparing cookies during their Family and Consumer Sciences course. Review Exercises 15) Use the Fundamental Counting Principle to determine the number of different ways a person could order a meal if they are to pick one entree from four choices, one side order from three choices, and one drink from four choices. 16) A student wishes to check out three books from the library. She will check out one historical fiction book, one biography, and one book on art history. Build a tree diagram to show how many ways can this be done if there are two historical fiction books, three biographies, and two books on art history that she is considering checking out. 17) How many different outcomes are possible for the total on a roll of two dice if one die has 6 sides and one die has 4 sides? 1.4 Combinations ***** Learning Objectives Know the definition of a combination Be able to calculate the number of combinations using the combinations formula and with technology We just looked at situations in which order matters. What if order does not matter? Suppose you have a younger brother or sister and your family goes out to a restaurant. There is a children's menu with activities at the restaurant that all the kids get. The owner of the restaurant has decided that each child will receive two different colored crayons to use on their menu. The restaurant happens to carry five colors of crayons: orange, yellow, blue, green, and red. This is a situation in which the order that the child gets their two color crayons does not matter. If you gave a child a red crayon and then a blue crayon, it would be the same as if you gave the child a blue crayon followed by a red crayon. As with permutations, the first question to ask is "Does Order Matter?". When the order does not matter, you are dealing with a situation that involves combinations. Example 1 Consider the color crayon problem in the previous situation. Make a list showing all of the different color crayon combinations that might occur. Be organized so as not to repeat any combinations. 15 www.ckl2.org Solution To be organized, use the letters O, Y, B, G, and R to represent the five colors (Orange, Yellow, Blue, Green, and Red). Alphabetizing the list to insure that we don't skip any combinations gives us BG, BO, BR, BY, GO, GR, GY, OR, OY, RY Notice that while we have BG, we don't have GB as that would be a repeat. It appears that there are 10 combinations possible. As with the Fundamental Counting Principle, we now must ask the question "How can we find the solution quickly?" Making a list works nice, but it could get a bit messy if the restaurant had 24 colors to choose from instead of 5 because our list would get very long. Out of curiosity, you may have tried 5/V However, when you work this out, you find that this gives us a result of 20 instead of 10. We must modify this formula for situations involving combinations. Shown below is the formula for finding how many combinations are possible when order does not matter. c = ft r n I r\(n -r)\ As with the permutation formula, the 'n' stands for the number of objects available and the V stands for the number of objects that will be selected. www.ckl2.org 16 Example 2 Consider the color crayon problem once again. Use the formula to find out the number of different color crayon combinations that are possible. Solution In our problem, 'n' is equal to 5 and V is equal to 2. Our calculation would be 5C2 = 2U5-2V = 2^| = w = ~0T = 10- Be very careful that you find the result for the denominator before you divide! Find the n C r command on your calculator and verify that 5C2 is indeed equal to 10. Example 3 Suppose that there are 12 employees in an office. The boss needs to select 4 of the employees to go on a business trip to California. In how many ways can she do this? Solution We first ask whether the order that the employees are selected matters. In this case, the answer is no because either you will be going on the trip or you won't be going. Being the fourth name on the list of people who get to go is just as good as being the first name on the list. We have 12 people to select from and we will be selecting 4 or 12C4 = imfi^n = 495. There are 495 possible combination of groups of 4 that might be selected to go on the trip to California. Problem Set 1.4 Exercises 1) Use the formula for combinations to find the value of each expression. Use a calculator to verify each answer. a) 5C5 b) eQ c) 3C0 d) 7 C 3 2) In how many ways can 3 cards be selected from a standard deck of 52 cards? 3) In how many ways can three bracelets be selected from a box of ten bracelets? 4) In how many ways can a student select five questions to answer from an exam containing nine questions? IT www.ckl2.org 5) In how many ways can a student select five questions to answer from an exam containing nine questions if the student is required to answer the first and the last question? 6) The general manager of a fast- food restaurant chain must select 6 restaurants from 11 for a promotional program. In how many different possible ways can this selection be done? 7) There are 7 women and 5 men in a department. In how many ways can a committee of 4 people be selected? 8) For a fundraiser, a travel agency has donated 5 free vacations to Mexico as grand prizes in a raffle. Suppose that 220 people paid for raffle tickets. In how many different ways can the vacation winners be selected? 9) A high school choir has 27 female and 19 male members. Two students will be selected from the choir to represent the school in the All-State Choir. a) In how many ways can the director select two students if she decides both students will be female? b) In how many ways can the director select two students if she decides both students will be male? c) In how many ways can the director select two students? d) Using your answers from a), b) and c), determine how many ways the choir director can select two students such that one student will be a male and one student will be a female? Review Exercises 10) In how many ways can the team captain of a kickball team arrange the kicking order for the 7 players on the team? 11) An electronic car door lock has five buttons on it and each button has a different digit. Suppose the combination to unlock the door is 4 digits long. a) How many different combinations are possible if digits can be repeated? b) How many different combinations are possible if digits cannot be repeated? 12) Give the sample space for the different results that may occur if a coin is flipped twice. www.ckl2.org 18 13) Decide whether each situation involves permutations. a) A teacher must pick two students from a class of 30 to put their answers on the board for problem #11 from last night's homework. b) In order to be allowed outside to play in the rain, a 5- year old must put on socks, shoes, and boots. c) A student has a strict bedtime of 11 pm. They need two hours to finish writing a paper, one hour for a math assignment, two hours for a science experiment, two hours to practice piano, and 1 hour for relaxation. It is 6 pm right now. 1.5 Mixed Combinations and Permutations C3 Learning Objectives • Determine whether a situation involves permutations or combinations • Understand the mathematical implications of the words 'and' & 'or' Having covered the basics of combinations and permutations, you are ready to have a mixture of prob- lems with slight variations. A common variation involves an understanding of some key words used in 19 www.ckl2.org mathematics. Commonly, the word "and" indicates multiplication and the word Consider the examples below. Example 1 In how many ways can committee of 3 people be chosen if there are 8 men and 4 women available for selection and we require that two men and one woman be on the committee? Solution The order that we place the people on a committee does not matter. It makes no difference if you are the first person or the last person selected for the committee. Either you are on the committee or you are not on the committee, therefore this is a combination question. Notice that we want two men and one women. The word 'and' indicates multiplication. In other words, we will look for the product of how many ways we can select two men from eight and one women from four. 8C2 X4 C\ = 28 x 4 = 112. There are 112 ways to select this committee of 3 people. Example 2 In how many ways can a committee of 5 people be chosen if there are 7 men and 5 women available for selection and we require at least 4 women on the committee? Solution We first ask "Does order matter?". In this case, the order that someone is placed on a committee does not matter. Either you are on the committee or you are not. Once again, we are dealing with a combination question. The key phrase in this example is at least. This can be interpreted to mean that we either select 4 women and 1 man or 5 women and men. Remember that the word 'and' indicates multiplication and the word 'or' indicates addition. It looks like we are going to have some addition and multiplication in this problem. 5C4 X7 C\ +5 C5 X7 Co = 5 x 7 + 1 x 1 = 35 + 1 = 36. There are 36 ways to put this committee together. Example 3 In a certain country, there are two political parties. Each party is responsible for nominating both a presidential and vice-presidential candidate. The candidates will participate in a debate once they are chosen. In the first party, there are 6 candidates available and in the second party there are 5 candidates available. How many different debate combinations are possible? Solution The order that we select the candidates does make a difference. Selecting party member 'A' for a presidential candidate and party member 'B' for a vice-presidential candidate is different www.ckl2.org 20 than selecting party member 'B' for a presidential candidate and party member 'A' for a vice-presidential candidate. Therefore, this is a permutations question. Since we will select candidates from the first party and candidates from the second party, we expect there to be multiplication in this problem as well. 6-^2 X5 P2 — 30 x 20 = 600. There are 600 different ways that the debate participants can be chosen. Problem Set 1.5 Exercises 1) In your own words, state how you can tell the difference between a combination and permutation problem. 2) Your closet contains 10 different styles of shoes. In how many ways can you pick out five different styles of shoes for the school week if you don't care which day of the week you wear each style? 3) Your closet contains 10 different styles of shoes. In how many ways can you pick out five different styles of shoes for the school week if you do care which day of the week you wear each style? 4) You are drawing a rainbow using five different colored crayons from your box of 24 colors. In how many ways can you draw a rainbow if the first color you pick will be the top layer and so on? 5) You are drawing a rainbow using five different colored crayons from your box of 24 colors. In how many ways can you pick the five colors for your rainbow? 6) Suppose 5 cards are dealt from a standard deck of 52 cards. a) How many unique 5-card hands are possible? b) In how many different orders can those 5 cards be dealt? 7) Suppose the majority party in a foreign country must select a prime minister and secretary of state from an eligible group of 36 party members. In how many ways can this be done? 8) There are 7 women and 5 men in a department. Four people are needed for a committee. a) In how many ways can a committee of 4 people be selected? 21 www.ckl2.org b) In how many ways can this committee be selected if there must be exactly 2 men and 2 women on the committee? c) In how many ways can this committee be selected if there must be at least 2 women on the committee? 9) A company has 8 cars and 11 trucks. The state inspector will select 3 cars and 4 trucks to be tested for safety inspections. In how many ways can this be done? 10) In a train yard there are 4 tanker cars, 12 boxcars, and 7 flatcars available for a train. In how many ways can a train be made up consisting of 2 tanker cars, 5 boxcars, and 3 flatcars? 11) Flakes-R-Us cereal comes in two types, Sugar Sweet and Touch O' Honey. If a researcher has ten boxes of each type, how many ways can she select two boxes of each for a quality control test? 12) In how many ways can a jury of 12 people be selected from a pool of 12 men and 10 women? 13) In how many ways can a jury of 6 men and 6 women be selected from a pool of 12 men and 10 women? 14) A corporation president must select a manager and assistant manager for each of two stores. In how many ways can this be done if the first store has 9 employees and the second store has 7 employees? (Employees will stay at their current stores.) 15) Suppose that in this trimester that every sophomore is required to take 2 math classes, 2 social studies classes, and a reading class. How many different combinations of teachers are possible for a given student if there are 9 math teachers, 12 social studies teachers, and 4 reading teachers available? (No student will have the same teacher for two different hours.) 16) In how many different ways can six people be assigned to three offices if there will be two people in each office? Review Exercises 17) Use the formula for combinations to find the value of 7C3. 18) Use the formula for permutations to find the value of qP^. 19) In how many ways can the letters in the word 'magic' be arranged? 20) How many different outcomes are possible for the total of when two 4-sided dice are rolled? 21) A teacher will select three students to work problems on the board from her class of 34 students. In how many ways can this be done if the three problems to be worked are #11, #14, and #26? www.ckl2.org 22 1.6 Chapter 1 Review ~^^^^fc There are three primary counting methods that are commonly used in probability: the Fundamental Counting Principle, combinations, and permutations. The Fundamental Counting Principle states that to find the number of outcomes for a given situation, simply multiply the number of ways each event may occur by each other. When deciding whether to use combinations or permutations, you must ask if the order matters. If so, use permutations, otherwise use combinations. When working with counting, make sure that you have an organizational strategy. For example, use an alphabetized list, a sketch, or a tree diagram. This will make it much easier for you to come up with the sample space. Chapter 1 Review Exercises 1) Suppose that two 5-sided dice are rolled. a) Draw a grid showing all the outcomes for the different totals that may occur. b) Use {brackets} to write down the sample space. c) Suppose a friend offers to play a game in which you are paid $4 any time a number divisible by 4 occurs. Otherwise you pay your friend $2. If you decide to play, would you expect to win money or lose money? Use your grid from part a) to help explain your answer. 2) The lunch at The Diner has a choice of ham, turkey, or roast beef on rye or white bread with coffee or milk. Draw a tree diagram that illustrates what a person might have for lunch if they pick only one meat, one bread, and one drink. 3) Find the value for each expression below. Show your work by hand and use your calculator to verify your results. a) 5! c) 7C5 d) (5 - 2)! e) 4! - 2! 4) There are four runners in a race. In how many ways can the runners finish the race? 5) A store has eighteen outfits available for display, but only six outfits can be used for a window display. If each order counts as a new arrangement, how many different arrangements are possible? 6) Paul has three baseballs and four bats. How many possible ball and bat combinations can he choose? 23 www.ckl2.org 7) How many license plates are possible if each plate must have three letters followed by three digits (repeats are allowed)? LXPLOBL Minnesota * DCE'353 CS3 • 10,000 lafees • PPI 8) How many license plates are possible if each must have three letters followed by three digits (repeats are not allowed)? 9) There are twenty candidates in the Mr. Minnesota contest. How many ways could the judges choose the winner, first- runner up, and second-runner up? 10) The yearbook editor must select two photos out of 42 juniors and two out of the 45 seniors for a page in the yearbook. How many photo combinations are possible? 11) A homeless shelter has decided to purchase all new kitchen appliances. They need one oven, one refrigerator, and one dishwasher. The appliance store has 7 brands of ovens, 6 brands of refrigerators, and 5 brands of dishwashers. In how many brand arrangements can they purchase their appliances? 12) An ice cream shop has 8 different flavors of ice cream available. How many 2-scoop cones can be made if you are allowed to have the same flavor for both scoops? 13) An ice cream shop has 8 different flavors of ice cream available. How many 2-scoop cones can be made if you decide not to have the same flavor for both scoops? 14) Suppose a jury of 12 is being selected from a pool of 20 candidates. In how many ways can this be done? 15) Suppose a jury of 12 is being selected from a pool of 13 men and 7 women. In how many ways can this be done if the judge states that the jury must contain exactly 5 women? 16) Suppose a jury of 12 is being selected from a pool of 13 men and 7 women. In how many ways can this be done if the judge states that the jury must contain at least 5 women? 17) In how many ways can I put together an outfit if I have 7 shirts, 5 pairs of pants, and 4 hats from which to choose? 18) For $7.99, a restaurant will sell you their lunch special. The special is either a hamburger or chicken sandwich, onion rings or fries, and soda or coffee. a) Make a tree diagram showing the different ways a customer may order the lunch special. b) How many outcomes are there? Use the Fundamental Counting Principle to justify your answer. Image References Breakfast http://worldaffairspittsburgh.blogspot.com www.ckl2.org 24 Tetrahedral Dice http://www.bbc.co.uk Harry Potter Books http://www.dipity.com Ice Cream Cones http://www.bunrab.com Raffle Ticket http://canuckamusements.com Color Crayon http://www.rosespet.com Exam http://www.iphlebotomycertification.com Coin http://www.marshu.com Rainbow http://tracynicholls.webs.com/ Jury http://adriandayton.com License Plate http://www.15q.net 25 www.ckl2.org Chapter 2 Calculating Probabilities 2.1 Calculating Basic Probabilities Learning Objectives Understand how to calculate and write a probability Understand what constitutes chance behavior Understand the concept of the Law of Large Numbers Probabilities give us an idea of how likely it is for a certain event to happen. For example, when a coin is flipped, the chance that it comes up heads is 50%. Probabilities can be expressed as decimals, fractions, percents, or ratios. We could have said the probability of flipping heads is , 0.5, g, 50% or 1:2. Each of these conveys the idea that we should expect to get a heads half of the time. Probabilities only give us an idea of what to expect in the long run. However, they do not tell us what will happen in the short term. www.ckl2.org 26 Suppose we flip a coin 10 times in a row and get heads every single time. The next coin flip is still a random event because while we cannot tell for certain what the next flip will be, we can be certain that about 50% of all tosses over a long set of tosses will be heads. Some people think that we are on a roll so we are more likely to get another heads. Others will say that getting tails is more likely because we are due to get tails. The truth is that we cannot tell what will happen on the next flip. The only thing we know for certain is that there is a 50% chance that the coin will be heads on its next flip. If we continue to flip this same coin hundreds of times, we would expect the percent of heads to get closer and closer to 50%. Chance Behavior is not predictable in the short term, however, it has long term predictability. The Law of Large Numbers tells us that despite the results on a small number of flips, we will eventually get closer to the theoretical probability. The theoretical probability should match what would happen in the long run for some random event. The outcomes in any random event will always get close to the theoretical probability if the event is repeated a large number of times. We might roll a dice 4 times in a row and get a 6 each time, however, if we rolled this dice hundreds of times, the percent of time that we get a 6 gets closer and closer to the theoretical probability of g. When calculating a probability, we divide the number of favorable outcomes (outcomes we are interested in) by the total number of outcomes. In other words, the probability that outcome 'A' occurs is found by the formula Pf/O - # °f favorable outcomes tne iormuiar(/ij - total # of outcomes • ***** fc3 Consider a standard deck of 52 playing cards. Standard Deck of 52 Playing Cards Clubs Spades Hearts Diamonds A* A* AY A* Z* 2* HP 2* 3* 34 99 i* 4* <1* MP <]* 5* 5* 99 5 + 6* 6* 6* 6 + 7* 7* 79 7 + B* M 99 R + 9* 1* MP M 10* 10* 10T 10* Jack* Jack* Jack¥ Jack* Queen* Queen* Queen* Queen* King* King* King* King* If we asked the question "What is the probability of being dealt a face card (jack, queen, or king)?", we would need to count how many cards are face cards and then divide by the total number of cards 52. 27 www.ckl2.org In this situation there are 12 face cards and 52 cards overall so our probability of getting a face card is 12 _ _3_ noo 52 — 13 U - Z " i - In probability, there are outcomes that are sure to happen and there are outcomes that are impossible. If we are once again dealing with a standard 52 card deck, the chance of being dealt either a red card or a black card if one card is dealt is 100%. The chance of being dealt a blue card is 0% since there are no blue cards in a standard deck. All random events have probabilities between and 1. In addition, the sum total of the probabilities for all possible outcomes in the sample space is equal to 1. In other words, if an event occurs, there is a 100% chance that one of the possible outcomes will happen. The list below summarizes these rules. a) The probability of a sure thing is 1. b) The probability of an impossible outcome is 0. c) The sum of the probabilities of all possible outcomes is 1. d) The probability for any random event must be somewhere from to 1. As shown earlier, we notate the probability of event 'A' happening as P(A). For example, the probability of rolling a three on a six-sided die can be written P(3) = g. Sometimes we are interested in the probability of an event not occurring. This is called the complement of the event. We can write the probability of the complement of event 'A' happening as P(~A), P(not A), or P(A C ). The formula for the complement of an event is Pinot A) = 1 - P(A). On our die rolling question, P(~3)= 1 - P(3) = 1 - I = I. In other words, there is a I chance of the dice not landing on a 3. It is important to notice that the probability of an event happening and the probability of its complement always add up to 1. □ Example 1 Which of the following situations are random events? i) A student looks through their closet to decide what shirt to wear to school. ii) A student labels each of their 6 pairs of shoes 1 through 6 and then rolls a single die to decide which pair to wear. iii) The state legislature decides to increase funding to schools by 3%. iv) A professional golfer makes a hole-in-one on a 200 yard hole. www.ckl2.org 28 Solution Situations i) and ii) are not random events. In both cases, there are additional factors that are influencing the decision. The day of the week or the temperature outside might influence your shirt choice and how much money the state legislature happens to have might influence funding. Both situations iii) and iv) are random events because while we can't predict what will happen in this particular instance, we can make long term predictions. We can predict the percent of the time the student might end up with the shoes labeled #2 and we can predict the percent of the time that the golfer will make a hole-in-one based upon previous performance. Example 2 In the game of pool, there are a total of 15 balls. Balls numbered 1-8 are solid and balls 9-15 are striped. There is only one black ball (the eight ball). There are two of every other color of pool ball. Suppose the pool balls were put in a bag and a single pool ball is pulled out of the bag. What is the probability that the ball: a) is yellow? b) is striped? c) has a number on it that is greater than 10? d) is not striped? Solution a) P(Y) = & * 0.13 b) P(Striped) = ^ « 0.47 c) P(> 10) = ^ = I « 0.33 29 www.ckl2.org d) P(~Striped)=l - P{Striped) = 1 - X = A » 0.53 In addition to these types of questions, we can also calculate probabilities by incorporating our counting methods from Chapter 1. Recall that the probability of an event occurring is the number of favorable outcomes divided by the number of total possible outcomes. Example 3 A jury of 12 people is to be selected from a group of 12 men and 8 women. What is the probability that the jury has at least 6 women on it? Solution The total number of outcomes possible is based upon selecting 12 members from a pool of 20. Since order will not matter, there are 20C12 = 125,970 ways to pick a jury of 12. We now want to have at least 6 women on the jury. This means we could have 6 women and 6 men or 7 women and 5 men or 8 women and 4 men on the jury. Mathematically, this would be 8 C 6 Xi 2 C 6 + 8 C 7 Xi2C5+ 8 C 8 Xi 2 C4 = 28x924 + 8x792 + 1x495 = 25,872 + 6,336 + 495 = 32,703. There are 32,703 ways to have at least 6 women on the jury out of a possible total of 125,970 different juries or 12 g 970 « 0.26. There is about a 26% chance that the jury will have at least 6 women on it. Sometimes, data is organized in a Venn Diagram, as shown in the example below. We will examine these in greater depth in section 2.3 but for now, it is important to understand that a Venn Diagram is an organizational tool that makes it easier to interpret a situation and answer basic probability questions. Example 4 A class of 30 students is surveyed to see whether or not they had a science class and/or a math class this trimester. There are 18 students that have a math class, 14 students who have a science class, and 4 students who have neither. It also turns out that this includes 6 students who currently have both classes. The results of the survey are shown in the Venn Diagram below. www.ckl2.org 30 ^^^^~ - — ^^^ /Math \ Science \ 12 6 8 — "^"^ 4 a) How many total students are taking a math class this trimester? b) What is the probability that a randomly selected student is taking a math class this trimester? c) What is the probability that a randomly selected student is taking both a math and science class this trimester? d) What is the probability that a randomly selected student is not taking either a math or science class this trimester? Solution a) There are 12 kids who only have a math class and 6 kids who have both a math a science class this trimester for a total of 18 kids. 18 — 3 30 5 b) P(Math) c) P{Math & Science) 0.6 JL — I 30 5 0.2 d) P(NoMathorS cience) = 4 = jk ~ 0.13 Problem Set 2.1 Exercises For problems 1-5, express your answer both as a fraction (reduce if possible) and as a decimal to the nearest hundredth. 1) Suppose a single card is dealt from a standard deck of 52 cards. Find the probability that the card is: a) a red card. b) a face card. 31 www.ckl2.org cj an ace. d) a three. e) a club. f) the three of clubs. g) a black king, h) not a spade. 2) A bag contains some jelly beans. There are a total of 6 red jelly beans, 4 green jelly beans, 2 black jelly beans, 5 yellow jelly beans, and 3 orange jelly beans in the bag. Suppose one jelly bean is drawn from the bag. a) Find P (purple). b) Find P(yellow). c) Find P(~red). 3) The game Scattegories uses a 20-sided die. It has all the letters of the alphabet on it except Q, U, V, X, Y, and Z. Find each probability below if the die is rolled one time. a) P (Vowel) b) PfVowel) c) P(Q) d) P(Q C ) e) P(a letter alphabetically after Q) 4) A single 6-sided die is rolled one time. Find the probability that the result is: a) a three www.ckl2.org 32 b) a seven c) an even number d) a prime number e) a number equal to or greater than 5. 5) The month of October in a 2011 calendar has 31 days with October 1st being a Saturday. Suppose a day is randomly selected. Find each probability. a) P (weekend) b) P(not a weekend) c) P(October 31st) d) P(October 32nd) e) P(~October 31st) f) P(an odd-numbered day) UC 1 U Or di U 1 1 printabtecalendars. resources2u. com Sunday Monday Tuesday ledaeeday Tlmrsday Friday Saturday 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 6) A roulette wheel contains 38 slots. When the wheel is spun, a ball is dropped onto the wheel and the ball will stop on one of the slots. There are 18 black slots, 18 red slots, and 2 green slots. Suppose the ball on a roulette wheel has landed on red four times in a row. What is the chance that the ball will drop on red on the next spin? 33 www.ckl2.org 7) A coin has been flipped 10 times. Suppose that it has come up heads on only 2 out of those ten times, a) What percent of the time has the coin come up heads? b) Suppose we flip the coin 90 more times and 45 of those 90 flips come up heads. Of the 100 flips completed so far, what percent of the time has the coin come up heads? c) Suppose we continue to flip the coin an additional 900 times and that 450 of those 900 flips come up heads. Of the 1000 flips completed, what percent of the time has the coin come up heads? d) As we flipped the coin more and more, the percentage of heads got closer and closer to 50% despite the fact that only 2 of the first 10 flips were heads. What rule does this illustrate? 8) Two 6-sided dice are rolled and we keep track of the total on the two dice. a) Make a 6 by 6 grid showing the different totals that you can get when rolling the two dice. b) What is the probability that you get doubles? c) What is the probability that you get a total of 7? d) What is the probability that you get a total of at least 8? www.ckl2.org 34 9) The high school concert choir has 7 boys and 15 girls. The teacher needs to pick three soloists for the next concert but all of the members are so good she decides to randomly select the three students for the solos. a) In how many ways can the teacher select the 3 students? b) What is the probability that all three students selected are girls? c) What is the probability that at least one boy is selected? 10) A test begins with 5 multiple choice questions with four options on each question. It then has 5 true/false questions. a) How many answer keys are possible? b) What is the probability of getting every question correct if a student guesses on each question. Leave your answer as a fraction. 11) A lawn and garden store is moving locations and needs to move its riding lawn mowers to the new store. They have 8 mowers with 36-inch decks, 15 mowers with 42-inch decks, and 6 mowers with 48-inch decks that need to be moved. The trailer they are using can move a total of 8 mowers on each load so several trips will have to be made. a) In how many ways can 8 mowers be randomly selected for the first load? b) What is the probability that all the mowers with 48-inch decks get selected for the first load? Leave your answer as a reduced fraction. c) What is the probability that the first load has exactly two 36- inch deck mowers, four 42-inch deck mowers, and two 48-inch deck mowers? 35 www.ckl2.org Review Exercises 12) In how many ways can three students be selected for a committee if there are 11 students from which to select? 13) A hockey player needs new skates, a new helmet, and a new stick. Hockey Central has 5 brands of skates, 6 brands of helmets, and 8 brands of sticks. In how many different ways can the player select one of each item? 14) Two standard 6-sided dice are rolled and the results from the two dice are added together. Build a grid to determine which outcome is most likely to occur. 15) On a TV game show, three contestants must each pick a box which they believe contains the grand prize based upon clues given about each box. In how many different ways can this be done if there are 10 boxes from which to choose? www.ckl2.org 36 2.2 Compound and Independent Events Understand how to perform the calculations for compound events Compute probabilities for situations with and without replacement Understand when two events are independent Understand how to compute the probability when two independent events occur From section 2.1, you found that it is quite straightforward to calculate probabilities for simple situations. What happens when we calculate probabilities from multiple events? For example, suppose you roll a single die and then flip a coin. What are the chances that the die comes up with a 5 and the coin gives you a heads? A situation that asks you to calculate probabilities for a situation that involves two or more events or steps is called a compound event. We will try to find out how to handle these types of situations by examining several situations and then making a conclusion. Example 1 Suppose a single die is rolled and a coin is flipped. What is the probability that the die comes up with a 5 and the coin gives you a heads? Use a list to help you find out. Solution Start with a list of all the possible outcomes. 1H, IT, 2H, 2T, 3H, 3T, 4H, 4T, 5H, 5T, 6H, 6T. There are 12 equally-likely outcomes. Of these, only 5H is a five with a heads. Therefore, the answer is j^- What you might have noticed is that P(five)=g and P(heads)=^. Curiously, g x | = j_. Is this just a coincidence or is there something more here? You might recognize that flipping a coin does not effect what you roll on the die. When two events do not have an impact on each other, the events are called independent. Consider the two situations below. Situation 1: Suppose your teacher picks students to do problems on the board. After each student does their problem, the teacher gives the student a piece of candy. Because your teacher wants make sure that every student gets a chance to do a problem and get a piece of candy, she keeps track of who has worked problems on the board. The selection of the next student is not independent of previous selections the teacher has made. 37 www.ckl2.org Situation 2: Does a coin have a memory? As far as we can tell, the answer is no. This suggests then that a coin does not pay attention to whether it came up heads or tails. It does not want to make sure that the same number of heads come up as tails. Even if it comes up heads many times in a row, the next flip of that coin is not influenced whatsoever by the previous flips. Successive coin flips are independent of one another. Example 2 Decide which pairs of events below are independent. i) Two cards are dealt, one after the other, from a standard deck of 52 cards. ii) A spinner with three colors is spun twice. iii) A single die is rolled and a coin is flipped. iv) You play on the school baseball team and you win a carnival game by throwing a baseball to try to break a plate. Solution Situations ii) and iii) represent pairs of independent events. The result from the first spin of the spinner does not affect the result of the second spin of the spinner. The result of the roll does not impact what happens when the coin is flipped. Situations i) and iv) are not independent. Once the first card from the deck is dealt, the probabilities for what the second card might be will change. For example, if the first card was the ace of spades, it is impossible for the second card to also be the ace of spades. Being a baseball player makes it more likely mean that you are accurate and can throw a ball harder than a typical person and you therefore would be more likely to break a plate. Let's do another example involving calculations and investigate if multiplying probabilities in a situation involving independent events gives us the correct result. Example 3 A coin is flipped three times in a row. What is the probability that all three flips result in heads? Find your answer by either using a tree diagram or by making a list. www.ckl2.org 38 Solution To be organized, we can make an alphabetically list. HHH, HHT, HTH, HTT, THH, THT, TTH, TTT. There are 8 outcomes altogether and only one of those is HHH. P(HHH)=g. The probability of heads on one flip is i. As in example 1, we have P(HHH) = 2 x^x^ = i. Note that the three coin flips are independent of each other. In both Example 1 and Example 3, we could multiply the probabilities of each individual event to get the probability of both events happening. This is always true of independent events. In other words, suppose we want the probability of both some outcome 'A' from one event and some outcome 'B' from a second event that is independent of the first event. If the probability of our first outcome is P(A) and the probability of our second outcome is P(B), then the probability of both A and B happening is P(A and B)=P(A)xP(B). For Independent Events P(A&B) = P(A)P(B) Example 4 You are dealt one card from each of two separate decks of cards. a) What is the probability that both cards are the king of clubs? b) What is the probability that the two cards are identical? Solution In both situations, the events are independent. a) We have P(K* k K*)=^ X ^ = ^ki ~ °- 00037 - b) There are two good ways to solve this problem. We could imagine that we do what we did in part a) for all 52 cards in the deck. We simply could multiply our answer for part a) by 52 to get g2 * 0.019. Another way to think about this would be to ask about each card separately. What is the chance that the first card could be useful in making a match? (100%) What is the chance that the second card will be useful in making a match? If we already have one card picked, the chance that the card from the second deck will match it is gg. 1 x F2 = ;T2 * 0.019. Multiplication of probabilities expands to more than just two independent events. It also works with three or more independent events and it even works with many situations that do not have independent events. In general, when finding the probability of compound events, multiply the probabilities of each 39 www.ckl2.org individual event. If we are interested in the probability of events A, B, and C happening, we can multiply P{A)xP{B)xP{C). This same principle also works with compound events in which we distinguish whether or not we have replacement. Suppose we are asked to pick two cards out of a deck. If we are asked to do this without replacement, we will select the first card and record what it is. When we select our second card, we must remember that the deck has changed. Nonetheless, we can find the probability of drawing these two particular cards by multiplying the individual probabilities. Example 5 Suppose you have a set of pool balls in a bag. You pull two pool balls out of the bag, one after the other, without replacement. What is the probability that both pool balls are striped? Solution There are 7 striped pool balls out of the 15 pool balls. The chance that the first pool ball is striped is j~. Since we are not going to replace the first pool ball, what is in the bag has now changed. There are only 14 pool balls left of which 6 are striped since the first one removed from the bag was also striped. The chance that the second pool ball is striped is y|. To find the probability that both pool balls are striped, we multiply the individual probabilities. This gives j§ x jg = 2jo = 5 = 0.2. There is a 20% chance that both balls will be striped if we use replacement. Example 6 Suppose you have a set of pool balls in a bag. You pull two pool balls out of the bag, one after the other, with replacement. (This means that after you record what the first ball is, you put it back into the bag and remix the pool balls before you select the second pool ball.) What is the probability that both pool balls are striped? www.ckl2.org 40 Solution There are 7 striped pool balls out of the 15 pool balls. The chance that the first pool ball is striped is j^. Since we are going to replace the first pool ball, what is in the bag has not changed. There are still 15 pool balls of which 7 still are striped. Therefore, the chance that the second pool ball is also striped is X. To find the probability that both pool balls are striped, we multiply the individual probabilities to get jg x ^ = ^ « 0.22. There is approximately a 22% chance that both balls will be striped if we do not use replacement. We also run into situations where we are dealing with compound events involving very large populations. In these sorts of situations, we must be careful about how we interpret the mathematics. Example 7 Approximately 20% of all Americans smoke. Suppose two Americans are selected at random. What is the probability that both Americans are smokers? Solution The chance that the first person is a smoker is 20%. Some students think that the chance that the second person is a smoker changes after the first person is selected, however, it does not. The population of America is so large that selecting a single person out from that population will not affect the overall percentage of Americans that smoke. The probability that the second person smokes is also 20%. P(2 smokers selected)=0.2 x 0.2 = 0.04 = 4%. Example 8 Approximately 20% of all Americans smoke. Suppose five Americans are selected at random. What is the probability that all five are non-smokers? Solution Since 20% of Americans are smokers, 80% must be non-smokers. This gives us 0.8 x 0.8 x 0.8 x 0.8 x 0.8 = (0.8) 5 ~ 0.33. The chance that all five Americans selected will be non-smokers is about 33%. Problem Set 2.2 Exercises 1) What does it mean for two events to be independent? 2) Suppose you are dealt one card each from two separate decks of cards. What is the probability that both of your cards are: a) red? 41 www.ckl2.org b) spades? c) jacks? d) face cards? 3) For each situation below, determine whether the two events are independent. a) Flip a coin and then draw a card from a standard deck of 52 cards. b) Draw a marble from a bag, do not replace it, and then draw a 2 nd marble from the same bag. c) Get a raise at work and purchase a new car. d) Drive on ice and lose control of your car. e) Have a large shoe size and have a high IQ. f) Be a chain smoker and get lung cancer. g) Dad is left handed and son is left handed. 4) A spinner with three equal spaces of red, blue, and green is spun one time. A single six-sided die is rolled once. What is the probability that you get blue and a number greater than 3? 5) Suppose you are dealt two cards, one after another from a standard deck of cards. What is the probability that both of your cards are: a) spades? b) the same suit? c) kings? 6) Three cards are drawn from a standard deck without replacement. Find the probability that: a) all are jacks. www.ckl2.org 42 b) all are clubs. c) all are red cards. 7) In a carnival game, players are given three darts and throw them at a set of balloons on a wall. Suppose there are eight balloons on the wall. Five of the eight balloons have slips of paper in them that say 'Winner' while three of the eight balloons have slips of paper that are blank. Suppose you pop a balloon with each of your three darts. If all three balloons have 'Winner' slips, you win the grand prize. If all three balloons have blank slips, you win the consolation prize. What is the probability that: a) you win the grand prize? b) you win the consolation prize? 8) A classroom contains 12 males and 18 females. Two different students will be randomly selected to give speeches. What is the probability that the two students who give speeches are: a) two females? b) two males? c) 1 male and 1 female (in either order)? (Hint: Use your answers from a) and b) along with some subtraction.) 9) If 18% of all Americans are underweight, find the probability that two randomly selected Americans will both be underweight. 10) A survey found that 68% of book buyers are 40 years old or older. If two book buyers are selected at random, what is the probability that both are 40 years old or older? 11) The Gallup Poll reported that 82% of Americans used a seat belt the last time they got into a car. If four people are selected at random, find the probability that they all used a seat belt the last time they got into a car. 12) Eighty-three percent of diners favor the practice of tipping to reward good service. If three restaurant customers are selected at random, what is the probability that all three are in favor of tipping? 13) Suppose that 25% of U.S. federal prisoners are not U.S. citizens. 43 www.ckl2.org a) Find the probability that a randomly selected federal prisoner is a U.S. citizen. b) Find the probability that three randomly selected prisoners are all U.S. citizens. 14) At a local university, 70% of all incoming freshmen have computers. If three students are selected at random, what is the probability that: a) none have computers? b) all three have computers? 15) The U.S. Department of Justice states that 6% of all murders occur without weapons. If three murder cases are selected at random, what is the probability that all three occurred with the use of a weapon? Review Exercises 16) Which of the following are random events? i) You need to pick 2 people to be your partners in a group project so you select two of your friends. ii) You make a rock skip across the surface of a lake 12 times. iii) A baby elephant is born and it is a boy. iv) You spin the big wheel on the TV game show "The Price is Right" and you win $1000. 17) In how many ways can two 12-graders be selected for speaking at graduation if there are 16 seniors that apply? One speaker will give a short introductory speech and one will give a longer speech that reflects upon the experiences of this particular senior class. www.ckl2.org 44 18) A family of 4 just won the lottery and goes to an auto dealership to purchase a new vehicle for each member of the family. The parents each decide that they want a car while their kids decide they would each like a truck. In how many ways can their purchase their 4 vehicles if the dealership has 17 cars and 23 trucks available? 19) The Strikers and the Kicks soccer teams are playing a best of five playoff series. The first team to win three games is the winner. Draw a tree diagram to show the different ways the series might play out. 2.3 Mutually Exclusive Outcomes Learning Objectives • Understand when two outcomes are mutually exclusive • Understand the concepts of unions and intersections • Be able to compute probabilities using Venn Diagrams and formulas Sometimes there are situations in which two different outcomes cannot occur at the same time. For example, if you roll a single die one time and you wish to find the probability of getting an even number and a 3 on that one roll. These two outcomes cannot occur at the same time. When it is impossible for two outcomes to occur at the same time, we say the outcomes are mutually exclusive or disjoint. If outcomes 'A' and 'B' are mutually exclusive then it is impossible for outcome 'A' and 'B' to happen at the same time, or P(A and B)=0. However, if 'A' and 'B' are mutually exclusive then P(A or B)=P(A)+P(B). Using proper notation we have P(A U B) = P(A) + P(B). Remember, this is only true if the two outcomes are mutually exclusive. For Mutually Exclusive Events P(AuB) = P(A) + P(B) P(AorB) = P(A) + P(B) The U can be read as the union of outcomes 'A' and 'B'. The probability for the union of two outcomes 'A' and 'B' can be thought of as the chance that either 'A' occurs, 'B' occurs, or both 'A' and 'B' occur. Example 1 A single die is rolled one time. What is the probability of getting either an odd number or a 6? Solution The outcomes of getting an odd number and getting a 6 are mutually exclusive since they 3,1 4 2 6 "T" 6 ~~ 6 ~~ 3 cannot occur at the same time. P(Odd U 6) = P(Odd) + P(6) = 1 + 6 = 6 = 3 S:: ^-67 Of course, not every situation involves mutually exclusive events. The diagrams above are Venn Diagrams. They are useful in showing us how different outcomes are related. Outcomes 'A' and 'B' are not mutually exclusive if they overlap. We say that there is an 45 www.ckl2.org Outcomes A and B are Mutually Exclusive. The outcomes do not intersect. A& Bare not Mutually Exclusive. The outcomes intersect. Figure 2.1 intersection of outcomes 'A' and 'B' if they overlap. For example, if we roll a single 6-sided die, the outcomes of getting an odd number and getting a number bigger than 3 intersect. They both include the number 5. The symbol for an intersection is Pi. A logical extension of the formula for the union of two outcomes that are not mutually exclusive is P{A U B) — P(A) + P{B) — P{A n B). The next two examples illustrate this formula. For Non-Mutually Exclusive Events P(A^>B) = P(A) + P(B)-P(A^B) P(A or B) = P(A)+P(B)-P(A and B) *** Example 2 A single 6-sided die is rolled. Suppose the outcomes we are interested in are getting an odd number and getting a prime number (2,3, or 5). Draw a Venn Diagram for this situation. Solution Notice that since the numbers 3 & 5 belong to both the odd numbers and the prime numbers, they are placed into the intersection of the 'Odds' circle and the 'Primes' circle. Notice also that the numbers 4 & 6 do not belong to either set and are placed outside both circles. www.ckl2.org 46 Example 3 A single 6-sided die is rolled. What is the probability of getting either an odd number or a prime number? Note that this is the same as asking for P{Odd U Prime). Solution Using the figure from Example 1 we see that there are four values out of six that are either odd or prime. Therefore, P(Odd or Prime)=g = | ~ 0.67. If we use the formula, P(OddU Prime) = P(Odd) + P(Prime) - P(Odd n Prime). This gives §+§-§=1=1* 0.67. Example 4 Suppose there is a 60% chance it will rain today and that there is a 70% chance that it will be over 90°F. Suppose also that there is a 45% chance that it will both rain and be above 90 degrees. What is the chance that it will neither rain nor be above 90 degrees? Solve using both a Venn Diagram and using a formula. Solution Start by drawing a Venn Diagram and noting that we have two circles, one for rain and one temperature. These two outcomes are not mutually exclusive because they can occur at the same time therefore the circles should overlap. We can quickly fill in the intersection of the two outcomes as 45%. If we remember that there is a 60% chance of rain and we already have 45% filled in for the rain circle, the remainder of the rain circle must be 15% so it adds up to 60%. Likewise, the remainder of the >90° circle must be 25%. 47 www.ckl2.org 1 >90 Rain Both 1 Degrees J Neither \ 15W \ 45% / 25% We can now see that we have a total of 15%+45%+25%=85%. This means that the 'Neither' category must be 15% to give us a total of 100%. Using the formula, P(RainU90) = P(Rain)+P(90)-P(RainD90). Filling in we get P(RainU90) 60% + 70% - 45% = 85%. 100%-85%=15%. Example 5 Which pairs of outcomes are mutually exclusive? a) You go to the pet store to buy a pet. Outcome A = You buy a pet that flies, Outcome B = You buy a pet that has no legs. b) You order a pizza. Outcome A = Your pizza has pepperoni on it, Outcome B = Your pizza has mushrooms. c) You select a football player to take a picture of for the yearbook. Outcome A = The player is a 4-year varsity starter, Outcome B = The player is 14 years old. d) Radio stations have 4-letter station names such as KDWB. You decide to pick a radio station to listen to. Outcome A = The station's 4-letter name starts with a W, Outcome B = The station's 4-letter name contains three E's. Solution a) is mutually excluse. You cannot buy a pet that both flies and also has no legs. b) is not mutually exclusive. You can order a pizza with both mushrooms and pepperoni. c) is mutually exclusive. If a 4-year varsity starter is 14 years old, then they would have been a varsity starter when they were 10 years old. That simply does not happen. d) is mutually exclusive. These both could happen if the radio station's call sign was WEEE. www.ckl2.org 48 Example 6 The Rockin' Rollers performance company books shows for their musicians. Currently, they have 30 musicians who play either bass guitar, lead guitar, or rhythm guitar. Some of these musicians play more than one instrument. Suppose 4 musicians can play lead, rhythm, or bass guitar. Fourteen can play lead or rhythm but not bass, two can play bass or rhythm but not lead, 3 can play bass only, and 4 can play rhythm only. There are no musicians who play lead and bass only. Draw a Venn Diagram to determine how many musicians play lead only. Solution From the information given, we can fill in a Venn Diagram as shown below. 49 www.ckl2.org Notice that we have used up 14 + 4 + 4 + 2 + 3 = 27 of the 30 musicians so that means there must be 3 musicians who play lead guitar only. Problem Set 2.3 Exercises 1) What does it mean for two outcomes to be mutually exclusive? 2) Give an example of two outcomes from a situation in which the two outcomes are mutually exclusive. 3) Give an example of two outcomes from a situation that are not mutually exclusive. 4) Consider each event. Decide whether each pair of outcomes are mutually exclusive. a) Roll a die: Get an even number and get a number less than 3. b) Roll a die: Get a prime number (2, 3, or 5) and get a six. c) Roll a die: Get a number greater than 3 and get a number less than 3. d) Select a student: Get a student with blue eyes and get a student with blond hair. e) Select a college student: Get a sophomore and get a student that is a math major. f) Select a course: Get an Algebra course and get an English course. g) Select a voter: Get a Republican and get a Democrat. 5) There are 200 male students at a particular school. Of these, 58 play football, 40 play basketball, and 8 play both. a) Draw and label a Venn diagram for this situation. b) How many play both sports. c) How many play basketball but not football? d) How many play football but not basketball? e) How many do not play football or basketball? 6) A single card from a standard deck can have many descriptions. For example, the King of Spades could be described as a black card, a face card, a king, or a spade. Suppose we pull a single card out of a deck and we pay attention to the outcomes of getting a red card, getting a jack, and getting a spade. a) Draw a Venn diagram to illustrate this situation paying attention to whether or not it is red, a jack, or a spade. www.ckl2.org 50 b) Shade the portion of your diagram with vertical lines that represents the intersection of getting a red card and getting a jack. c) Shade the portion of the diagram with horizontal lines that represents the union of getting a jack or getting a spade. 7) An architectural firm is putting out bids to design two large governmental buildings. Suppose they believe they have 35% chance of getting the contract for the first building, an 80% chance of getting the contract for the second building and a 10% chance of getting neither job. a) Draw a Venn Diagram for this situation and use your diagram to find the chance that they get both contracts. b) Use a formula for this situation to find the chance that they get both contracts. 8) A student tells their teacher that they want to build a cabinet in woodshop. Students sometimes build this project with oak only, sometimes with cherry only, sometimes with both and sometimes with neither. There is a 40% chance the project will be built using oak, a 50% chance the project will be built using cherry, and a 30% chance that the project will be built using both types of wood. What is the chance that the student will not use either oak or cherry? 9) Consider a set of 15 pool balls. Balls numbered 1 through 8 are solid and balls 9 through 15 are striped. Suppose the balls are placed into a bag and one ball is randomly selected. Find the probability that: a) you selected either a solid ball or a ball numbered greater than 12? b) you selected an even numbered ball or a solid ball? c) you selected a solid ball or a striped ball? d) you selected a ball that was striped and even? 10) Suppose you again have a standard set of 15 pool balls. This time, you pull two pool balls out of the bag, replacing the first ball before you select the second ball. What is the probability that: a) your two pool balls are both solid? 51 www.ckl2.org b) you pick exactly the same ball twice? c) your first ball was solid and your second was odd? 11) At a particular school, there are 20 teachers. Three of them teach math, 5 teach science, and 3 teach computer science. It turns out that there is one teacher who teaches all three classes and one teacher who teaches both science and computer science. Draw a Venn diagram to illustrate the situation and determine how many of the 20 teachers teach courses other than math, science, or computer science. Hint: You will need 3 circles to build this diagram. Review Exercises 12) Two cards are selected from a standard deck of 52 cards, one after the other without replacement. What is the probability that the two cards are both face cards? 13) Suppose 90% of all Americans have attended a religious ceremony at least one time in the past year. What is the probability that 4 randomly selected Americans will all have attended at least one religious ceremony in the past year? Uh II 1 14) A single 6-sided die is rolled once and a single card is drawn from a standard deck of 52 cards. What is the probability that the die shows a result greater than 3 and the card is a heart? 15) A young girl has a box of 8 color crayons but has decided they need only 3 colors to make a picture for her grandfather. In how many ways can the child select the three crayons? 16) In how many ways can a committee of 4 people be selected if there must be at least 1 man and 1 women on the committee and there are 6 men and 7 women from which to pick? 2.4 Tree Diagrams and Probability Models Learning Objectives • Understand how to build and properly notate a tree diagram • Understand how to calculate probabilities using a tree diagram • Understand how to verify if a tree diagram is correct www.ckl2.org 52 Be able to build a probability model by using a tree diagram As we advance through probability, it becomes very apparent that we need to be quite organized with our problems as they become more complex. In this section we will use tree diagrams to help us calculate probabilities for given situations. Tree diagrams are a visual aid that can help us break down a situation and calculate probabilities. There are two key principles that we must observe for all tree diagrams. First of all, to find the total probability for any given branch on a tree, multiply the individual probabilities along that branch. Secondly, the sum of the probabilities from the ends of each branch must total to 1. We will examine several examples of probabilities using tree diagrams in order to solidify our understanding of this concept. Example 1 At a restaurant, there are two breakfast platters that are served, one featuring pancakes and one featuring eggs. There are also two choices for drinks, milk or juice. Thirty percent of customers choose the pancake platter while 70 percent choose the egg platter. Forty percent of customers choose milk while 60 percent choose juice. Assume the drink choice is independent of platter choice. Build a probability model for this situation by using a tree diagram. Solution Step one is to build the tree diagram as shown below. Be sure to label each branch with what it represents and the associated probability. Step two is to calculate the probabilities at the end of each branch. To do this, we multiply the probabilities along each branch. For example, the top branch's value of 0.28 was found by multiplying 0.7 by 0.4. 53 www.ckl2.org P(MNk)= P(Eggs}=.7 .28 .42 .12 P(Pan.) .18 Our probability model, shown below, summarizes the data in the tree diagram. There are two critical ideas to pay attention to here. First of all, the probability at the end of each branch is the product of the probabilities on that branch. Secondly, notice that the sum of the four probabilities at the ends of the branches add up to 1. The table below that summarizes the results from our tree diagram and is called a probability model. Notice that the probabilities in the table sum to 1. Table 2.1: Order Probability Eggs & Milk 0.28 Eggs & Juice 0.42 Pancakes & Milk 0.12 Pancakes & Juice 0.18 Example 2 The Diamonds and the Dusters baseball teams are going to play a best-of-three playoff series. The first team to win two games is the winner of the series. Based on previous performance this season, the Diamonds have a 60% chance to win any game they play against the Dusters. Build a tree diagram and then build a probability model to determine the probability of each team winning the series. www.ckl2.org 54 Solution Notice that in the tree diagram below, not all branches go three games. There are two sets of branches that only go two games. A third game does not need to be played in all situations. 36 (Diamonds Win} P(Diam.)=.6 P(Diam.)=.6 P(Dust.)=.4 .144 (Diamonds Win) .096 (Dusters Win) .144 (Diamonds Win) P(Dust.}=.4 P(Dust.)=.4 ^ .096 (Dusters Win) .16 (Dusters Win) We can now make a probability model for this situation by adding up the probabilities for the Diamonds winning in two or three games and the Dusters winning in two or three games. This is shown in the probability table below. The probability of the Diamonds winning the series is .36+.144+.144=.648. The probability of the Dusters winning the series is .096+.096+.16=.352. In our solution, we were careful to make sure each branch was labeled and included a probability. Once again, we multiplied the values along each branch to get the probability at the end of the branch. Notice again that the total of all the probabilities in the probability model adds to 1. Table 2.2: Outcome Probability Diamonds Win in 2 games 0.36 Dusters Win in 2 games 0.16 Diamonds win in 3 games 0.288 Dusters win in 3 games 0.192 55 www.ckl2.org Example 3 You are dealt two cards from a standard deck of 52 cards. What is the probability that the two cards can be classified as a red card and a face card (in either order)? Solution Begin by considering the first card. We are concerned primarily with getting a face card and then a red card or a red card and then a face card. Let's start by thinking about what we would need on our first card to have a chance to reach our goal. The first card could be a red card, a face card, or both a red and a face card. Any other card would result in us not reaching our goal. As a result our tree diagram will have four initial branches as shown below; red (not face), face (not red), red & face, or other. 20 P(R,~F}= — 52 P(F)= 12 51 Pf~Fl _39 --41 my- 26 240 2652 156 2652 31 P(RorF)= — 51 186 P(Other)=— \ \. P(~R,~F)= — 52 \ \ 51 2652 Once our first set of branches are complete, we look at the second stage. We will examine the red, not face branch in detail (top branch). There are 20 red cards out of our 52 card deck that are not face cards. Once we have that card, we now want to know the chances of getting a face card (assuming we just drew a red card). There are still 12 face cards in the deck but there are on i o 0/i n now only 51 cards remaining in the deck. We multiply this branch out to get 55 X 51 = 26!^ • There are two other branches we work in a similar fashion as they are the only other branches that help us achieve our goal. Adding these gives us ^^ + ^H? + ^^ = ^j^ « 0.22. There is about a 22% chance of getting a red card and a face card. www.ckl2.org 56 The key to success when working with tree diagrams and probability models is to work in a neat and organized fashion. Many errors are often due to sloppy work. In addition, another useful suggestion is to make your tree diagrams large enough so that you have plenty of room to work. Problem Set 2.4 Exercises 1) Fourteen red marbles and sixteen green marbles are in a bag. Two marbles are picked out one at a time and replaced after they are picked. Build a tree diagram and probability model to show the different combinations of marbles that could be pulled out of the bag. 2) A bag contains a standard set of pool balls. Two balls are pulled out, one after another, and not replaced. What is the probability that the two balls are a solid and a striped ball in either order? (Recall that there are 8 solid pool balls and 7 striped pool balls.) 3) A bag contains a $100 bill and two $20 bills. A second bag contains 1 gold marble and 2 silver marbles. You get to pick one bill out of the first bag. After this, you pick a marble out of the second bag. If you get the gold marble, you get to triple the amount of money you pulled from the first bag. If you get a silver marble, you get to double the amount of money you picked from the first bag. Build a probability model for all the different amounts of money that you might win. 4) A basketball player is practicing shooting free throws. Suppose she makes 75% of her free throw attempts. Make a tree diagram and probability model for what might happen if she decides to shoot three free throws. In other words, what is the probability that she makes zero shots, one shot, two shots, or all three shots. 5) A coin is flipped and then two dice are rolled. Build a probability model that shows how likely it is to get heads followed by doubles, heads and a non-doubles, tails and doubles, and tails and non-doubles. 6) A spinner with four evenly-spaced wedges of red, blue, green, and orange on it is spun and a coin is flipped. a) How many different outcomes are possible? b) Build a probability model that shows the probabilities for each outcome. 7) A baseball player is a .400 hitter. This means that he gets a hit (single, double, triple, or home run) 40% of the time he has an at-bat. Use a tree diagram to build a probability model that shows the probability of the player having 0, 1, 2, or 3 hits if he has 3 at-bats in one game. 8) In some sports, the home team wins a higher percentage of games played. Suppose the Dunkers and the Hoopsters are playing a best-of-three game series against each other. When the Dunkers are home, they 57 www.ckl2.org have a 60% chance of winning a game against the Hoopsters. When the Hoopsters are home, they have a 55% chance of winning a game against the Dunkers. The Dunkers will be the home team in games 1 and 3 while the Hoopsters will be the home team in game 2. Use a tree diagram to build a probability model for this situation. The model should show the chances that the Dunkers win in 2 games or in 3 games and the chances that the Hoopsters win in 2 games or in 3 games. 9) A patient is scheduled to have two surgeries. The results of each surgery are independent of each other. Suppose the first surgery has a 90% success rate and the second surgery has an 85% success rate. Build a probability model by using a tree diagram that shows all the different results that might occur. 10) A bag contains ten red cubes numbered 1 through 10 and five blue cubes numbered 1 through 5. You pull two cubes out of the bag without replacement. What is the probability that the two cubes will be an odd cube and a red cube (in either order)? Review Exercises 11) Suppose events 'A' and 'B' are mutually exclusive and that P(A)=0.35 and P(B)=0.14. What is P(A U B)l 12) How many unique three-letter 'words' can be formed by selecting three letters from the alphabet if no letter may be repeated? 13) How many unique three-letter 'words' can be formed by selecting three letters from the alphabet if letters may be repeated? 14) 20% of all households in the Twin Cities get the Star Tribune newspaper delivered to their home while only 15% get the Pioneer Press delivered to their home. If 70% of homes do not get either newspaper delivered, what percent of homes get both newspapers delivered? www.ckl2.org 58 2.5 Conditional Probabilities and 2- Way Tables Learning Objectives • Understand how to calculate conditional probabilities • Understand how to calculate probabilities using a contingency or 2-way table It is quite easy to calculate simple probabilities. What is the chance of rolling a 4 with a single die? What is the chance of being dealt a queen from a deck of cards? We are now going to focus on conditional probabilities. A conditional probability is a probability in which a certain prerequisite condition has already been met. We can start by thinking about cards being dealt from a standard deck of 52 cards. As each card is dealt, what remains in the deck changes. A gambler in a casino will pay close attention to cards played. If many face cards have already been dealt, the observant gambler will understand that the next card has a higher chance of not being a face card. Suppose we want to know the probability that our next card will be a face card given that the first card was the 7 of diamonds. The formal notation for this is P(Face|7*). This is read as "The probability of a face card given that we already have been dealt the 7 of diamonds.". Often times the math for these situations is very logical. In our case, we have simply reduced the deck by one card and there are still 12 face cards in the deck. Therefore P(Face|7*)=Fj = jf ~ 0.24. Example 1 Two cards are dealt from a standard deck of 52 cards. Find each conditional probability. a) P(2ndred|lst 2*) b) P(2nd red|lst 2*) c) P(2nd club 1 1st red) Solution a) There are still 26 red cards out of the remaining 51 cards after the two of clubs is dealt, i * 0-51- b) There are only 25 red cards out of the remaining 51 cards after the two of diamonds is dealt. — « 49 51 u -^ y - c) There are still 13 clubs out of the remaining 51 cards after a red card is dealt. f| ~ 0.25. Example 2 In a common poker game, 5 cards are dealt to a player. The best possible hand is called a royal flush. This occurs if a player gets the ten, jack, queen, king, and ace all of the same suit. What is the chance of being dealt a royal flush? Leave your answer as a fraction. 59 www.ckl2.org Solution We will solve this by looking at one card at a time. What is the chance that the first card might be part of a royal flush? Before any cards are dealt, there are four 10's, four jack's, four queen's, four kings, and four aces available. Twenty of the 52 cards can help you on your way to a royal flush. Once you receive this card, what are the chances that the second card will also help on your way to the royal flush? We might answer this by simply imagining us getting a useful card on the first card. Suppose our first card was the jack of spades. There are only 4 other cards of the remaining 51 cards that will help now, the 10, queen, king, and ace of spades. Suppose we get one of those cards, perhaps the king of spades. There are now only 3 cards of the remaining 50 that can help us complete our royal flush. Suppose our third card was the queen of spades. Only 2 of the remaining 49 cards will help our quest for our fourth card. Likewise, there is only one card of the last 48 that can help us on card number five. Putting this all together, we have |§ x ^" x ^) x ^ x ^ = 311 875 200 = 649710' Putting this in perspective, if you dealt 1000 poker hands every single day, it would take nearly two years to deal 649,740 hands, of which we would only expect about one to be a royal flush. Good Luck! Another way we can look at conditional probabilities is through the use of two-way tables or contingency tables. These are often referred to as two-way tables because there are two distinct pieces of information gathered in these tables. For example, we may record how many siblings you have and in how many activities you participate in school. Two-way tables can be filled in either using counts or probabilities. We will start by answering simple questions such as "What is the probability that a student participates in exactly 2 activities?". After we understand how to work with these tables, we will begin asking more complex questions such as "What is the chance a student participates in 3 activities given that they have 1 sibling?". Let's begin with an easy example to help us understand how to read these tables. Example 3 Suppose we survey all the students at school and ask them how they get to school and also what grade they are in. The chart below gives the results. Suppose we randomly select one student. www.ckl2.org 60 Bub Walk Car Other 9* or 10 th grade 106 30 70 4 11* or 12 th grade 41 58 184 7 a) Give all the row and column totals. b) What is the probability that the student walked to school? c) What is the probability that the student was a 9th or 10th grader? d) What is the probability that a student either rode the bus or is in 11th or 12th grade? Solution a) Bub Walk Car Other Total 9* or 10* grade 106 30 70 4 210 ll 1 * or 12 th grade 41 SB 1S4 7 290 Total 147 88 254 11 500 b) There were 88 walkers out of 500 total students or ^ = ^ « 0.18. c) There were 210 9th or 10th graders out of 500 total students or fgg = fg = 0.42. d) There are 147 kids who rode the bus and there are 290 kids who are 11th or 12th graders. However, notice that these two categories intersect and we must be careful not to count the 41 kids who are in both categories twice. We will take the 290 11th or 12th graders and just add the 106 bus riders who are not 11th and 12th graders for a total of 396 students. The probability of selecting an 11th or 12th grader or a bus rider is |gg = ^ ~ 0.79. In the example above, note that the total across the bottom, 147+88+254+11, and the total for the last column, 210+290, both add up to 500. This is true of all 2-way tables. Now that we have the basic ideas down in a contingency table, let's move to a couple of more challenging questions. Example 4 Consider the completed chart in the solution of part a) of Example 3. 61 www.ckl2.org a) What is the probability that a student is in 11th or 12th grade given that they rode in a car to school? b) What is P(Walk|9th or 10th grade)? Solution a) The trick to dealing with conditional probabilities in two-way tables is to make sure that you only use what you are given. We are given that they rode in a car to school. We will only look at the Car column. We first note that there were a total of 254 kids who rode in a car to school. We then see that 184 of these kids were 11th and 12th graders. This gives us 234 = jff ~ -72. b) We want the probability that a student walked to school given that they were in 9th or 10th. We will only look only at the 9th and 10th grade row. There are 210 students who are 9th and 10th graders. Of these, only 30 walked to school. This gives us ML = i » .14. Example 5 The manager of an ice cream shop is curious as to which customers are buying certain flavors of ice cream. He decides to track whether the customer is an adult or a child and whether they order vanilla ice cream or chocolate ice cream. He finds that of his 224 customers in one week that 146 ordered chocolate. He also finds that 52 of his 93 adult customers ordered vanilla. Build a contingency table that tracks the type of customer and type of ice cream. Solution Start by filling in the values we are given and then work from there. The table below shows what we are given in the initial problem. Adult Child Total Vanilla 52 Chocolate 146 Total 93 224 Our next step is to fill in the Adult/Chocolate space with 41, the Child/Total box with 131, and the Total/Vanilla box with 78 by using subtraction. It is now easy to fill in the remaining www.ckl2.org 62 boxes. For example, we can quickly determine that the Child/Chocolate box must be 146 - 41 = 105. You can verify that this table is correct by checking each row and column total. Adult Child Total Vanilla 52 26 78 Chocolate 41 105 146 Total 93 131 224 Example 6 A survey asked students which types of music they listen to? Out of 200 students, 75 indicated pop music and 45 indicated country music with 22 of these students indicating they listened to both. Use a Venn Diagram to find the probability that a randomly selected student listens to pop music given that they listen country music. Solution Consider our Venn Diagram below. First fill in the both section with 22 students. We can now logically deduce how many students are left to fill up the Pop circle and the Country circle. Since we are given that they listen to country music we may only use the information that is in the Country circle. There are only 45 students that landed in this circle. Of the 45 students who listen to country music, 22 of them also listen to pop music or ^ w .49. 63 www.ckl2.org Problem Set 2.5 Exercises 1) Figure 2.2 shows the counts of earned degrees for several colleges on the East Coast. The level of degree and the gender of the degree recipient were tracked. Row & Column totals are included. Bachelor's Master's Professional Doctorate Total Female 542 128 26 18 714 Male 438 165 38 20 661 Total 980 293 64 38 1375 Figure 2.2 a) What is the probability that a randomly selected degree recipient is a female? b) What is the probability that a randomly chosen degree recipient is a man? c) What is the probability that a randomly selected degree recipient is a woman, given that they received a Master's Degree? d) For a randomly selected degree recipient, what is P(Bachelor's Degree) Male)? 2) In poker, 5 cards are dealt to a player. One of the stronger poker hands is a flush. This means that all 5 cards are of the same suit, for example, all hearts. What is the probability of being dealt a flush? 3) The table below shows the probability breakdown of ages and genders for the typical American college student. Each value in the table is given as a probability. For example, there is a 12% chance that a randomly selected college student will be a male between 25 and 34 years old. Table 2.3: Male Female 14-17 .01 .01 18-24 .30 .30 25-34 .12 .13 >34 .04 .09 a) What is the probability that a randomly selected American college student is female? b) What is the probability that a randomly selected American college student is female given that the student is more than 34 years old? c) What is the probability that a randomly selected college student is either a female or more than 34 years old? www.ckl2.org 64 4) Suppose that 40% of adults like eating bananas while 60% like eating apples. Suppose also that 32% of adults like eating both. What is the conditional probability that a randomly selected adult likes apples given that they like bananas? Use a Venn Diagram to answer this question. 5) Another good poker hand is called a straight. This means that your five cards will be numerically in order such as an 8, 9, 10, jack, and queen. The cards do not need to match suit in a straight. Suppose you receive the first four cards of a five card poker hand. You have 5V, 7*, 8*, and 9*. What is the probability that the next card will give you a straight? 6) Suppose you receive the first four cards of a five card poker hand. You have 3v, 4*, 5*, and 6*. What is the probability that your next card will give you a straight? 7) A statistics class has 18 juniors and 10 seniors in it. 6 of the seniors are females and 12 of the juniors are males. Build a contingency table to find the probability that a randomly selected student is: a) a junior or a female? b) a senior or a female? c) a junior or a senior? d) a female given that the student was a senior? 8) At a used-book sale, there are 120 children's books and 80 adult books available. 50 of the adult books are nonfiction while 40 of the children's books are nonfiction. All other books are fiction. Build a contingency table to find the probability that a randomly selected book is: a) fiction. b) not a children's nonfiction c) an adult book or children's nonfiction. 65 www.ckl2.org d) a children's book given that it was nonfiction. 9) Cable channels 6, 8, & 10 show quiz shows, comedies, & dramas. The table below shows the distributions of these shows. Table 2.4: Quiz Show Comedy Drama Channel 6 4 3 4 Channel 2 3 5 Channel 10 1 If a show is selected at random, find the probability that the show is: a) a quiz show or shown on Channel 8. b) a drama or a comedy. c) a comedy given that it is shown on Channel 8. d) shown on Channel 6 given that it is a drama. 10) Animals on the endangered species list are given in the table below by type of animal and whether it is domestic or foreign to the United States. Table 2.5: United States Foreign Mammals 63 251 Birds 78 175 Reptiles 14 64 Amphibians 10 8 An endangered animal is selected at random. What is the probability that it is: a) a bird found in the United States? b) foreign or a mammal? c) a bird given that it is found in the United States? d) a bird given that it is foreign? www.ckl2.org 66 11) Suppose a standard set of pool balls (1-8 are solid and 9-15 are striped) are in a bag. Two balls are picked out of the bag without replacement. a) Find the probability that second ball is striped given that the first ball was the 10 ball. b) Find P(2nd striped] 1st solid). 12) Suppose you receive your first three cards of a five card poker hand. You have 5*, 6*, 7*. What is the probability that your next two cards will result with you having a straight? Review Exercises 13) Suppose that 25% of all 9th graders have an unweighted GPA after their 9th grade year of 3.5 or higher. Also suppose that 60% of all 9th graders are involved in a sport at some time during their 9th grade year. Assume that a student's GPA and whether or not they are in a sport are independent of one another. Draw and label a tree diagram for this situation and build a probability model that summarizes the different probabilities possible for 9th grade students in regards to GPA and a sport. 14) A special deck of cards contains only the face cards and aces from a standard deck of cards. a) If one card is dealt, what is the probability that the card is an ace? b) If one card is dealt, what is the probability that the card is a black ace? c) If two cards are dealt, what is the probability that both cards are face cards? 15) Suppose for a moment that all months have exactly 30 days and the chance of you being born in any particular month is j^- What is the probability that neither of two randomly selected people will have been born in the same month as you? 16) The standard California license plates made in 2011 or later must begin with a digit between 6 or 9 followed by a letter anywhere from T to Z. They then have any two letters followed by any three digits. How many of these license plates are possible? 2.6 Chapter 2 Review Probability in its most basic form is a simple question of how likely it is for an outcome to occur. We are always looking to divide the number of favorable outcomes by the total number of outcomes. We cannot 67 www.ckl2.org predict a specific outcome for a random event but the Law of Large Numbers allows us to make long term predictions of chance behavior. There are numerous rules dictating how we must calculate probabilities. These rules often deal with whether events are independent, outcomes are mutually exclusive, or whether replacement is used. We also must deal with conditional probabilities for situations in which a particular outcome is assumed to have occurred. To help us organize each situation, we can often utilize Venn Diagrams, tree diagrams, and contingency or 2-way tables. Having a clear understanding of situations and being organized when dealing with probability is critical when performing probability calculations. Chapter 2 Review Exercises 1) Suppose that 57 of 110 students at a school are underclassmen (freshmen or sophomores) while the rest of the students are upperclassmen (juniors or seniors). Suppose three students are selected at random. a) What is the probability that all three of the students are underclassmen? b) What is the probability that all three students are upperclassmen? c) What is the probability that there is at least one underclassman and at least one upperclass- man in the group of three students? 2) Two 6-sided dice are rolled, one after the other. Find each probability. a) P(total of 10 or more) b) P (doubles) c) P (total is even or less than 6) d) P(an odd product) e) P (first die is greater than second die) f) P (a 6 or a three is showing on at least one die) g) P(an odd total or a 2 is showing) 3) A pet store surveys his customers during the day and finds that 15 customers own dogs and 9 own cats. Included in these were 4 customers who owned both. a) Draw a Venn Diagram for this situation b) How many total customers were surveyed? c) Suppose one of these customers was selected at random. What is P (owned a dog)? d) Suppose one of these customers was selected at random. What is P(own a dogjown a cat)? e) Suppose one of these customers was selected at random. What is P(own a catjown a dog)? 3) Suppose that 40% of all adults in a certain town are females and that 60% are males. In addition, 60% of the females hold full-time jobs while 80% of the males hold full-time jobs. www.ckl2.org 68 a) Draw and label a tree diagram to represent this situation. b) What is the chance that a randomly selected person holds a full-time job? 4) For Halloween at my house, kids spin a spinner that has three equally marked spaces labeled 1, 2, and 3. The number they spin is the number of pieces of candy they get. In my bag, I start with 20 chocolate bars and 30 sugar bombs - all with identical packaging. Trick-or-Treaters pick randomly out of my bag after they spin. I only restock my candy bag after each child finishes picking all their candy. a) What is the chance that a trick-or-treater gets to pick three pieces of candy? b) Suppose a trick-or-treater spins a three. What is the chance that they pick three sugar bombs? c) Suppose a trick-or-treater spins a three. What is the chance that they pick three chocolate bars? d) What is the chance that the trick-or-treater gets only one chocolate bar and nothing else? e) What is the chance that the trick-or-treater gets exactly one chocolate bar and one sugar bomb? 5) Suppose the table below gives a breakdown of the ages and genders of the teachers at your school. Table 2.6: <29 30-39 40-49 >50 Male 5 6 18 7 Female 7 7 13 4 69 www.ckl2.org GM Cars 14 Trucks 8 Vans 2 Find the probability that a randomly selected teacher is: a) a male. b) 39 years old or younger. c) either a male or at least 50 years old. d) from 30 to 39 years old given that they are a female. e) a female given that they are at least 40 years old. 6) Consider the 2- way table shown below that shows the number of different types of automobiles produced by major manufacturers. Table 2.7: Ford Chrysler Toyota 11 12 7 9 5 6 3 5 3 What is the probability that a randomly selected vehicle is: a) a Ford? b) a truck? c) a van or a Toyota? d) a car given that the vehicle is built by GM? e) a Ford given that the vehicle is a truck? 7) Two cards are dealt from a standard 52 card deck without replacement. What is the probability that the two cards are both face cards? 8) A baseball player has a batting average of .250 which means that he averages one hit for every four times he comes to the plate. What is the probability that this player will end up with exactly 2 hits if he comes to the plate 3 times in a single game? 9) Two bags have an assortment of marbles in them. The first bag contains 11 black, 12 white, and 7 gold marbles. The second bag contains 9 black and 11 white marbles. One marble is randomly selected out of each bag. www.ckl2.org TO a) Draw a tree diagram to represent this situation. b) What is the probability that the two marbles are both black? c) What is the probability that the two marbles are the same color? 10) A special deck of cards contains only the eight red cards that are face cards or aces. Two cards are dealt off the top of the deck. a) What is the probability that the two cards you end up with are both kings? b) What is the probability that the two cards are of different value? c)What is the probability that the two cards have the same value (two kings, two queens, etc...)? d) What is the probability that the two cards are the same suit? 11) A bag contains ten red cubes numbered 1 through 10 and five green cubes numbered 1 through 5. Two cubes are pulled from the bag at random. What is the probability that the two cubes are: a) both red? b) both odd? c) the same color? d) the same value? 12) For a carnival game, a bag contains one $100 bills and nine $20 bills. You roll a single 6-sided die one time. If you roll a one or two you get to pull one bill out of the bag. If you roll a three, four, five, or six, you get to pull two bills out of the bag. a) Draw a tree diagram for this situation. b) Build a probability model for this situation. b) What is the probability that you win exactly $120? 71 www.ckl2.org 13) A burglar alarm system has three separate detection mechanisms it uses to detect an intruder. Suppose a skilled burglar has an 30% chance to get around the first part of the detection system, a 60% chance of getting around the second part of the system, and a 55% chance of getting around the third part of the system. Assume each part of the detection system is independent of the other parts of the system. a) What is the chance that the system does not detect the burglar? b) Based upon your answer to part a), what must be the chance that the system does detect the burglar? c) What is the chance that the burglar can get around exactly two of the three detection systems 14) On a basketball team, players can play at least one of three positions; guard, forward, or center. Suppose that 30 girls try out for the basketball team. During tryouts 13 girls indicate they can play guard only, 3 state they can play center only, 6 state they can play center or forward and the rest state they can play forward only. A player is selected at random. a) Draw a Venn diagram for this situation. b) What is the probability that the randomly selected player says they can play forward? c) Given that the player indicates they can play forward, what is the probability they can also play center? 15) A girl is deciding what jewelry to wear as she gets ready for school. She has 5 bracelets, 6 rings, and 8 necklaces from which to choose. a) In how many ways can she choose exactly one of each item to wear from the 19 available items? www.ckl2.org 72 b) If she decides to randomly select three pieces of jewelry, what is the probability that all three of the items she picks are exacty the same type of jewelry? c) What is the probability that she picks exactly one bracelet, one ring, and one necklace if she randomly selects three pieces of jewelry? 16) Your statistics teacher needs to select 3 students to help demonstrate an activity. Your class has 12 sophomores, 19 juniors, and 5 seniors in it. Your teacher makes a random selection of three students. a) In how many ways can your teacher select three students from this class of 36 students? b) What is the probability that all three students will be juniors? c) What is the probability that exactly one student from each grade will be selected? 17) All football plays that an offense can run can be classified as a pass, run, or a kick. No play can ever be put into two categories. a) If an offense completes two plays, will these two plays be independent of each other? Why or why not? b) If the offense runs one play, are the possible outcomes (pass, run, or kick) mutually exclusive? Why or why not? Image References Coins http://coinauctionshelp.com Pool Balls http://plutonius.aibrean.com Scattegories Die http://ehow.com October Calendar http://printablecalendars.resources2u.com Roulette Wheel http://www.partypokersupplies.co.uk Kids at Board http://teachers.greenville.kl2.sc.us Two Striped Pool Balls http://demo.physics.uiuc.edu Ace and King of Spades http://www.123rf.com Seat Belt http://sawmengzhi.blogspot.com Graduation http://www.prlog.org Aerosmith http://www.obit-mag.com Cabinet http://www.renovation-headquarters.com Cathedral in St. Paul, MN http://www.scenicreflections.com Tree http://www.onenewsnow.com $100 Bill http://onlinecurrencytradingfxcm.blogspot.com Royal Flush http://www.artpoker.net 73 www.ckl2.org Straight http : / /www.findabet .co.uk Turtle http://www.maine.gov Trick or Treaters http://www.myrenioteradio.com Bag of Marbles http://www.worldwiseimports.com Burglar http://www.emovingstorage.com www.ckl2.org 74 Chapter 3 Expected Values & Simulation 3.1 Probability Models & Expected Value (Note- Three Separate Videos for the Section) ***** ***** ***** UN m Learning Objectives Be able to construct a probability model (expected value table) given all possible outcomes and the associated probabilities Be able to calculate the expected value for a situation given a probability model Be able to calculate missing values in a probability model given information about the expected value in a situation. Suppose you walk into a casino. You will see all sorts of games varying from blackjack and poker to slot machines. It would not take long for you to notice that there are some players who are winning some money, sometimes a substantial amount. You might wonder how the casino makes money when they are clearly giving some money away. 75 www.ckl2.org Costa Garnet gst* Casinos have a clear understanding of expected value. The expected value for a situation can be thought of as the average result over the long run. In other words, it can be thought of the expected winnings or average payout for a game of chance. Consider the thinking of the owner of a casino. While there are some people who win a little, and occasionally a few people who win a lot, most people end up losing some money at the casino. The casino actually expects some people to occasionally win big. In fact it makes for great advertisement! As long as the mathematics show that the expected value is in the casino's favor, the casino will continue to make money in the long run. In this section, we will focus on how to calculate the expected value. As mentioned above, the expected value can be thought of as the average result over the long run. Recall that to find the average value of a series of numbers, we simply add up the numbers and divide by how ever many numbers there are. For example, the average of 3, 4, 5, and 6 is 4.5 because + j + = 4.5. You will notice that the average value of 4.5 is not one of the numbers in the original set of numbers. This is often also true with expected values. The expected value for a situation does not have to be one of the possible values. Use the concept of averages to find the expected value for the example below. Example 1 A game is played in which a coin is flipped one time. If the coin lands on heads, the player wins $5. If the coin lands on tails, the player wins $10. What is the expected value for a player who plays this game one time? Solution The expected value is $7.50. This is strange because it is actually impossible for a player to win $7.50. They could only win either $5 or $10 but the average win will be $7.50. One way to see this is to actually play the game two times. If the flips come out matching their theoretical probabilities, one of the flips will be heads for $5 and the other will be tails for $10. The player will have won $15 in two games so the average win or expected value would be $5+$10 ^ = $7.50. This method works quite well in simple situations, but it gets more cumbersome as the situations get more complex. Consider the example below. www.ckl2.org 76 Example 2 Student council is raising money to support a program called "Shoes for the Homeless". A booth was set up in the lunchroom at which students could pledge a donation of $1, $5, or $10 for money towards a large shoe purchase. 125 students pledged money for this fundraiser. Eighty students pledged $1, 25 students pledged $5, and 20 students pledged $10. a) Build a probability model for this situation. b) What was the average donation per student? Solution a) Remember that a probability model needs probabilities, not just counts. For $1 pledges we have y|jj = 0.64, for $5 pledges we have ^ = 0.2, and for $10 pledges we have ^ = 0.16. Notice that 0.64+0.2+0.16=1 or 100%. Value $1 $5 $10 Prob. 0.64 0.2 0.16 b) There were 80 students who pledged $1 each for a total of $80, there were 25 students who pledged $5 each for a total of $125, and there were 20 students who pledged $10 each for a total of $200. All the pledges added together give us $80 + $125 + $200 = $405. We now divide to get KM*- = $3.24 per student. The average donation per student was $3.24. You may have noticed that the values in the probability model in Example 2 can be used to find the average donation as well. AverageDonation = ($1) (0.64) + ($5) (0.2) + ($10) (0.16) = $3.24. Simply multiply the amount of the donation by the probability of that donation for each amount and add those results together. This leads us to our expected value formula which is given in Figure 3.1 below. EV = (Value l)(Prob. l)+( Value 2) (Prob. 2)+ (Value 3) (Prob. 3)+... Figure 3.1 Example 3 What is the expected value for the total of a roll for two 6-sided dice? Solution We will address this two ways. The first method will be done by using averaging and the second method will be done by using the expected value formula. 77 www.ckl2.org Begin by building the sample space for the sum of two dice. As in section 1.1, we get the dice chart shown below. Notice that there are exactly 36 equally likely spaces on the grid. So instead of playing just one time, suppose we play 36 times. If everything matches the theoretical probabilities, each of these outcomes would happen exactly one time. Add the values for each of the 36 spaces and divide by 36. For simplicity, we will add diagonally to get -A- = 7. The expected value is 7 which means that 7 is 2+3+3+4+4+4H -10+10+10+11+11+12 36 36 the average value of a roll of two 6-sided dice. ♦ 1 2 3 4 5 6 1 2 3 4 5 6 7 2 3 4 5 6 7 3 3 4 5 6 7 3 9 4 5 & 7 3 9 10 5 6 7 S 9 10 11 6 7 3 9 10 11 12 Let's now do this problem by building a probability model and using the expected value formula, EV = (value 1) (prob. 1) + (value 2) (prob. 2) + (value 3) (prob. 3) + ... The probability model for this situation is given below. Does this make sense and did you verify that the total of the probabilities in the table add up to 1? Value 2 3 4 5 6 7 8 9 10 11 12 Prob. 1 36 2 36 3 36 4 36 5 36 6 36 5 36 4 36 3 36 2 36 1 36 Using our expected value formula we have EV = (2) (i) + (3) (£,) + (4) (|) + (5) (A) + ( 6 ) (A) + (7 ) (J.) + (8 ) (£) + (9 ) ( * ) + (10 ) (|) + (11) (it) + (12) I Jgj = 7. The expected value for the total when two dice are rolled is 7. Example 4 A carnival game is being played that has several prizes that a player can win. Suppose a probability model is put together for this game as shown below. a) Find the missing value. b) Calculate the expected value and explain what it means. Table 3.1: Probability Model for Carnival Game Value Probability $30 0.01 $20 0.03 $10 ??? $1 0.9 www.ckl2.org 78 Solution a) The probabilities in a probability model must add up to 1. We recognize that 0. 01+0. 03+???+0. 9=1 must be true. The missing value must be 0.06. b) EV = ($30) (0.01) + ($20) (0.03) + ($10) (0.06) + ($1) (0.9) = $0.30 + $0.60 + $0.60 + $0.90 = $2.40. Our expected value is $2.40. In other words, if this game were played many times, the average payout would be $2.40. Note that the expected value of $2.40 is not a possible prize that a player can win. Example 5 Suppose a casino game has an expected payout of $1 every time it is played. A player is paid nothing 45% of the time, they are paid one dollar 35% of the time, and they are paid three dollars 15% of the time. There is one more payout amount in this game. a) Build a probability model for this situation. Be sure to calculate the percent of time the remaining payout occurred. b) How much should this payout be so that the expected value is $1? Solution a) Start by noticing we have used 45%+35%+15%=95% of all outcomes. This means that the remaining outcome must be 5%. This allows us to build a probability model that is mostly complete. Table 3.2: Probability Model Amount $0 $1 $3 ??? Probability 0.45 0.35 0.15 0.05 Our calculation is now based upon the expected value formula. Use 'x' to represent the missing amount. ($0) (0.45) + ($1) (0.35) + ($3) (0.15) + (x) (0.05) = $1. $0 + $0.35 + $0.45 + (x) (0.05) = $1 $0.80 + (x) (0.05) = $1. 79 www.ckl2.org (x) (0.05) = $0.20 x=$4. The missing value is $4. Example 6 A carnival game has prizes and probabilities as shown in the table below. How much should the game cost if the owner of the game wants to average a $2 profit per player? Value S3 $5 $20 Prob. 0.65 0.30 0.05 Solution First calculate the expected value to get EV = $3 x 0.65 + $5 x 0.30 + $20 x 0.05 = $4.45. This means that the average player will be paid $4.45 when they play. Therefore, the owner should charge $2 more than this or $6.45. Problem Set 3.1 Exercises 1) The table below represents the number of vehicles and the associated probability of having that number of vehicles in an individual household. What is the expected number of vehicles in a typical household? Table 3.3: Vehicle Ownership # Owned Probability 0.02 1 0.26 2 0.37 3 0.19 4 0.12 5 0.04 2) A student sells products as part of a fundraiser to raise money for a choir trip to New York. She sold 75 items total which included 50 rolls of cookie dough for $6 each, 15 packages of butter braids at $10 each, and 10 bake-at-home bread packs for $12 each. a) Find the percent of her sales for each item. b) Build a probability model for this situation. c) Find the expected value of a sale for this particular student. 3) The owner of Friendly's Casino decides that she will set up her payouts in their 'Fast Cash' game so that the average gambler neither wins nor loses money. For a gambler who plays this game, the chance of www.ckl2.org 80 getting paid nothing is 30%, the chance of getting paid $5 is 40%, the chance of getting paid $10 is 25%, and the chance of getting paid $30 is 5%. How much will the owner of Friendly 's charge for this game? to JEh4mC(m4- LAS VEGAS 4) The owner of Greedy's Casino decides he wants to make an average of $1.50 every time a gambler plays the game called 'Funny Money'. The chance of getting paid $2 is 20%, the chance of getting paid $5 is 40%, the chance of getting paid $10 is 30%, and the chance of getting paid $15 is 10%. What should the owner of Greedy's charge to play this game? 5) In a certain racing video game, players try to go around a track as many times as possible. If a racer completes a lap in time, they continue on to the next lap. If they don't complete a lap in time, their race is complete at the end of the lap they are currently finishing. The probability model below gives the probabilities of the maximum number of laps completed by people who play the video game. What is the expected number of laps completed for each racer? Table 3.4: Laps Completed During Racing # of Laps Probability 1 0.29 2 0.38 3 0.17 4 0.11 5 0.05 81 www.ckl2.org 6) In a certain casino game, the average payout (expected value) for a player is $2.53. A partially completed probability model for this game is given below. Table 3.5: Casino Game Payouts Amount Paid $0 $1 $3 ??? $21 Probability 0.32 0.47 0.08 0.07 0.06 a) What is the missing amount? b) If the casino was going to set a price for this game, do you think they would choose to charge $2 to play or $3 to play? Explain your choice. c) If the casino was going to set a price for this game, do you think they would choose $3 to play or $6 to play? Explain your choice. 7) What is the average roll for a single 6-sided die? 8) A coin is flipped one time. If it lands on heads, you win $20. If it lands on tails, you win $30. Build a probability model and calculate the expected value for this game by using the expected value formula. 9) Two students are given the partially completed probability model below as part of a project. The teacher tells them that the expected value for this situation is $6.95. Table 3.6: Probability Model Given to Students Value $3 $6 $10 $50 Probability 0.25 0.35 ??? 0.07 a) Assuming that the expected value of $6.95 is correct, what should be the value of the missing probability? Explain why this is impossible. b) Assuming the expected value of $6.95 is incorrect, what should be the value of the missing probability? c) The given expected value of $6.95 was incorrect. What is the correct expected value? www.ckl2.org 82 10) I want to come up with a game that has 5 prizes. There will be a 20% chance of getting paid Si, a 25% chance of getting paid $3, a 15% chance of getting paid $4, and a 30% chance of getting paid $7. a) What is the probability of winning the 5th prize? b) What is the value of the 5th prize if the expected payout for the game is $4.75? 11) For Halloween next year, I have decided that I will distribute an average of 1.6 pieces of candy per child who comes to my door. To help me do this, I have set up a game of chance whereby each trick-or-treater gets to play a game that determines how many pieces of candy they get to pick from my bag. I started building a probability model that shows the probabilities of being able to select 0, 1, 2, 3, or 4 pieces of candy. Unfortunately, I did not have time to finish my table. Use what I have so far to answer the questions. Table 3.7: Halloween Candy Distribution # of Pieces 12 3 4 Probability 0.03 0.45 x y 0.02 a) Give the expected value equation using the variables x and y and the expected value of 1.6. Simplify your equation by combining like terms. b) Give an equation using the variables x and y that uses the fact that the probabilities in a probability model must add up to one. Clean up your equation by combining like terms. c) Using your answers from parts a) and b), write a system of equations and solve for the variables x and y. Review Exercises 12) A sample of 325 students were asked which electronic device they use most frequently, their cell phone, a computer (including wireless devices) , or a television (including video games) . The gender of the student was also recorded and the results are shown in the two-way table below. 83 www.ckl2.org Cell Phone Computer Television Total Male 60 30 55 145 Female 115 45 20 180 Total 175 75 75 325 a) What is the probability that a randomly selected student was a male? b) What is the probability that a randomly selected student was most likely to say they used a cell phone most frequently? c) What is the probability that a student was male given that they indicated they used the television most frequently? d) What is the probability that a student indicated that they used a computer most frequently given that they were a female? 13) One floor of an office building is being remodeled and redecorated and an employee is responsible for picking out three different styles of chair and two different styles of table for their office furniture. Suppose the furniture store has 10 different chair styles and 4 different table styles that would be appropriate for office furniture. In how many different ways can the employee select the three chair styles and two table styles? 14) Consider a standard deck of 52 cards. Suppose 1 card is drawn randomly from the deck. Find each probability. a) P(Red Card) b) P(Spade) www.ckl2.org 84 c) P(Face) d) P(Heart|Red) 3.2 Applied Expected Value Calculations (Note- Three Separate Videos for the Section) *** *m*** * ****** ***** ***** ***" n ft ft Learning Objectives • Understand the concept of a fair game • Be able to analyze a game of chance by building a probability model and calculating expected values from scratch Casinos have a very delicate balancing act they must manage. First of all, they want to make money. However, people just don't like to lose money. In order to make money, the casino games have to be in favor of the house and not the player. Why don't casinos tilt the games even more to the house's favor? If they did, their expected value would certainly go up. On the other hand, attendance at the casino would go down. No matter what the odds, a casino can't make money unless they can keep people coming through the doors. Setting the games up so that there are still winners, some occasionally big, is good for attendance. You might even know someone who has made a large amount of money at a casino. In this section, we will bring together our ideas about calculating probabilities from Ch. 2 along with the concept of expected value in order to be able to analyze a game of chance. We begin where we left off in section 3.1. A fair game is a game in which neither the player nor the house has an advantage. In other words, when all is said and done, the average player will not have made or lost any money whatsoever. 85 www.ckl2.org Example 1 A bag has 10 red marbles and 8 blue marbles in it. A player reaches into the bag pulling out 2 marbles, one after the other without replacement. If the color of the two marbles match, the player wins $10. If they don't match, the player wins nothing. The game costs $5 to play. a) Use a tree diagram to help find the probability that the two marbles match. Use your result to build a probability model. b) Is this game a fair game? If so, explain why. If not, give the value that the game should cost in order to be fair. Solution P(Red)= P(Blue) P(Red)= — 17 90 306 80 306 80 306 56 306 Table 3.8: Marble Distribution Result Value Probability Red, Red $10 90 306 Red, Blue $0 80 306 Blue, Red $0 80 306 Blue, Blue $10 56 306 www.ckl2.org 86 EV = ($10) (H) + ($0) (H) + ($0) (H) + ($10) (H ^ $1,460 306 $4.77 The expected value is $4.77. Notice, however, that $4.77 is the expected amount that the house pays out each game. The expected value for the house is $0.23 because every player must pay $5 to play. At $5, the game is not fair because it favors the house by an average of 23 cents every time the game is played. To be a fair game, it should cost $4.77. The game of GREED is a game of chance in which players try to decide when they have accumulated enough points on a turn to stop. Two 6-sided dice are rolled. The player gets to keep the total that shows on the two dice. After every roll, the player can either decide to roll again and try to add to their current total for that turn or stop and put their points in the bank. The only catch in this game is that if a total of 5 is rolled, all points accumulated on that turn are lost. For example, suppose the first roll has a total of 9 and the player decides to go again. The next roll has a total of 7. The player now has 16 points accumulated on this turn and must decide to either put those 16 points in the bank or risk them. If they decide to risk the 16 points and a total of 5 comes up next, the score for that turn will be 0. Example 2 Suppose a person is playing GREED and has accumulated 26 points so far. Is it to their advantage to roll one more time? Solution We will build a probability model and calculate the expected value based upon what might happen with one more roll. (See Example 3 from Section 3.1.) For example, there is a gg chance that the total will be 2. This would mean the player would have a total of 28 points with one more roll. The highest a player could have after this turn would be 38 points if they happen to roll a total of 12. The risk is that the player will roll a total of 5 and lose their 26 points. Value 28 29 30 32 33 34 35 36 37 38 Prob. 1 36 2 36 3 36 4 36 5 36 6 36 5 36 4 36 3 36 2 36 1 36 The one item to be careful about here is that it is impossible to get a total of 31 in the chart. Remember, if you roll a total of 5, you lose all of your points. When we perform our expected value computation for the probability model above, we get approximately 29.6. In other words, if we roll exactly one more time, our average result will be almost 30 points. This is definitely better than stopping with 26 points. It is to the advantage of the player to roll again. Example 3 An investor is going to make a long-term investment in a company. If all goes well, an investment of $100 will be worth $900 in twenty years. The risk is that the company may go bankrupt within twenty years in which case the investment is worthless. Suppose there is a 25% chance that the company will go bankrupt within 20 years. What is the expected value of this investment? 87 www.ckl2.org Solution Start by building a probability model as shown below that shows that there is a 25% chance of making nothing and a 75% chance of making $900. Value $0 $900 Prob. 0.25 0.75 EV = ($0) (0.25) + ($900) (0.75) = $675. Taking into account that this investment cost $100, the investor should get an average profit of $575. Problem Set 3.2 Exercises 1) In the carnival game Wiffle Roll, a player will roll a wiffle ball across some colored cups. Suppose that if the ball stops in a blue cup, the player wins $20. If it stops in a red cup, they win $10, and if it stops in a white cup, the player wins nothing. There are 25 white cups, 4 red cups, and 1 blue cup. Assume the chances of stopping in any cup is the same. How much should this game cost if it is to be a fair game? 2) In a simple game, you roll a single 6-sided die one time. The amount you are paid is the same as the amount rolled. For example, if you roll a one, you get paid $1. If you roll a two you get paid $2 and so on. The only exception to this is if you roll a 6 in which case you get paid $12. What should this game cost in order to be a fair game? 3) In the Minnesota Daily 3 lottery, players are given a lottery ticket based upon 3 digits that they pick. If their 3 digits match the winning digits in the correct order, then the player wins $500. If the digits don't match, then the player loses. The game costs $1 to play. What is the expected value for a player of this lottery game. 4) Suppose you are playing the game of GREED as described in Example 2. You have accumulated a total of 55 points on one turn so far. Is it to your advantage to roll one more time? 5) Suppose you are playing the game of GREED again. This time you have accumulated a total of 60 points in one turn so far. Is it to your advantage to roll one more time? 6) Using your results from numbers 4) and 5) and a little more investigation, for what number of points in a turn in the game of GREED does it make no difference if you roll one more time or stop? In other words, at what point total does the expected value with one more roll give the same total as if you had stopped? www.ckl2.org 88 7) A bucket contains 12 blue, 10 red, and 8 yellow marbles. For $5, a player is allowed to randomly pick two marbles out of the bucket without replacement. If the colors of the two marbles match each other, the player wins $12. Otherwise the player wins nothing. What is the expected gain or loss for the player? 8) An insurance company insures an antique stamp collection worth $20,000 for an annual premium of $300. The insurance company collects $300 every year but only pays out the $20,000 if the collection is lost, damaged or stolen. Suppose the insurance company assesses the chance of the stamp collection being lost, stolen, or destroyed at 0.002. What is the expected annual profit for the insurance company? 9) A prospector purchases a parcel of land for $50,000 hoping that it contains significant amounts of natural gas. Based upon other parcels of land in the same area, there is a 20% chance that the land will be highly productive, a 70% chance that it will be somewhat productive, and a 10% chance that it will be completely unproductive. If it is determined that the land will be highly productive, the prospector will be able to sell the land for $130,000. If it is determined that the land is moderately productive, the prospector will be able to sell the land for $90,000. However, if the land is determined to be completely unproductive, the prospector will not be able to sell the land. Based upon the idea of expected value, did the prospector make a good investment? 10) A woman who is 35 years old purchases a term life insurance policy for an annual premium of $360. Based upon US government statistics, the probability that the woman will survive the year is .999057. Find the expected profit for the insurance company for this particular policy if it pays $250,000 upon the woman's death. 11) A bucket contains 1 gold, 3 silver, and 16 red marbles. A player randomly pulls one marble out of this bag. If they pull a gold marble, they get to pick one bill at random out of a money bag containing a $100 bill, five $20 bills, and fourteen $5 bills. If they pull a silver marble out of the bag, they get to pick one bill at random out of a bag containing a $100 bill, two $20 bills, and seventeen $2 bills. If your marble is red, you automatically lose. The game costs $5 to play. a) Build a tree diagram for this situation. b) Build a probability model for this situation. c) Calculate the expected gain or loss for the player. 12) A spinner has four colors on it, red, blue, green, and yellow. Half of the spinner is red and the remaining half of the spinner is split evenly among the three remaining colors. A player pays some money to spin one time. If the spinner stops on red, the player receives $2. If it stops on blue, the player receives $4. If it stops on either green or yellow, the player wins $5. What should this game cost in order to be a fair game 7 89 www.ckl2.org 13) A bag contains 1 gold, 3 silver, and 6 red marbles. A second bag contains a $20 bill, three $10 bills, and six $1 bills. A player pulls out one marble from the first bag. If it is gold, they get to pick two bills from the money bag (without replacement). If it is a silver marble, they get to pick one bill from the money bag, and if the marble is red, they lose. The game costs $3 to play. Should you play? Explain why or why not. 14) Suppose the Minnesota Daily 3 lottery adds a new prize. You still get $500 if you match all three digits in order, but you can also win $80 if you have the three correct digits but not in the right order. The game still costs $1. What is the expected value for a player of this lottery game? Review Exercises 15) Consider the partially complete probability model given below. Value $2 $4 $7 $11 Prob. X .25 .15 .05 a) What is the value of 'X'? b) What is the expected value for this situation. 16) The student council is starting to prepare for prom and decides to name a committee of 6 members. Suppose that they decide the committee will have 2 juniors and 4 seniors on it. In how many ways can the committee be selected if there are 8 juniors and 8 seniors from which to select? www.ckl2.org 90 17) Three cards are dealt off the top of a well-shuffled standard deck of cards. What is the probability that all three cards will be the same color?": 18) A student does not have enough time to finish a multiple choice test so they must guess on the last two questions. List the sample space of the possible guesses for the last two questions if each question has only choices a, b, and c. 91 www.ckl2.org 3.3 Simulation and Experimental Probability (Note- Two Separate Videos for the Section) ***** ***** £3 t3 Learning Objectives • Understand how to generate random numbers using a random digit table or a calculator • Be able to properly assign digits to simulate a random situation • Be able to interpret results from a simulation and understand the connection to the Law of Large Numbers For many of the problems we have addressed, putting together a theoretical model is very reasonable for us to do. A theoretical model gives a picture of what should happen in the long run for any situation involving probability. It will give us a very clear idea of what to expect out of a particular situation. If you have been dealt an ace, you can quickly figure out the probability that the next card will be a face card. That probability is a theoretical probability. However, the truth is that in many situations it is beyond the scope of the mathematics of this course to calculate theoretical probabilities. In these situations, we can estimate probabilities by performing a simulation through the use of an experimental model. Some of our simulations can be done quite easily using actual probability tools like dice or spinners. Some situations, though, will require us to use a random number generator or a table of random digits. Be sure you can find the random number generator on your calculator. You can see a table of random digits at the end of this book in Appendix A, Part 1. A table of random digits contains a random mix of digits from through 9. These digits can be used to simulate virtually any situation involving chance behavior from rolling dice to drawing cards to modeling other known situations. It is important that you are detailed in your explanation of how you will assign digits so that others may model your simulation procedure exactly. The 6 steps for using a table of random digits for a simulation are shown below. 1) Assign the same number of digits to each of the different possible outcomes. 2) Choose a line number to use from the random digit table. Sometimes the line number is given to you. 3) State how many digits you will select at a time. If your largest value is less than 10, you will be able to select one digit at a time. If it is less than 100 you will have to use two digits at a time. If it is less than 1000, you will have to use 3 digits at a time and so on. 4) State what values you will ignore. These typically are values that are larger than your biggest value and number combinations like 000. 5) Know when to stop. Pay attention to the number of trials you must complete. www.ckl2.org 92 6) Select your digits and summarize your results. Example 1 Use the line of random digits below to randomly select three days from the month of October. 3547655972394216585004266354354374211937 Solution October has 31 days in it so we must identify 31 different outcomes. Our largest value is 31 so I will have to select two digits at a time. I will ignore 32 through 99 and 00. Assign two digits per day, so for example, 01 = the first of October. 3^4-^^55-97-23-94-2165-85-004266354354374211937 Notice that we crossed out 35, 47, etc... because these were all beyond our largest value of 31. Our three days are the 16th, 4th, and 26th of October. Example 2 Suppose you wish to roll two dice a total of 5 times and keep track of the totals. You don't have any dice, but you do have access to the line of random digits below. Explain how you could simulate the rolls of two dice using the random digits and then perform the simulation. 19223 95034 05756 28713 96409 12531 42544 82813 Solution We will have to roll two dice. Each die will have a value from 1 through 6 on each roll. We will ignore digits 7, 8, 9, and as a die can never come up with those values. The first three digits in the line of random digits are 1,9,2. The first die will be a 1. We ignore the 9. The second die will be a 2. This gives a total of 3. Our second roll picks up right where we left off and we get a 2 and a 3 for a total of 5. Using the same procedure, our next three totals will be 8, 9, and 11. Our five results were 3, 5, 8, 9, and 11. Remember that it is unwise to make assumptions after only a very small set of rolls. For example, it would be incorrect to say that a total of 7 is unlikely to happen since it did not come up on our simulation. We only simulated the rolls 5 times which is not nearly enough to make a conclusion. The Law of Large Numbers states that as we increase the number of trials we should get closer and closer to the theoretical probability. Theoretically, there is a i chance that the total is 7. If we did our simulation for thousands of rolls, we would expect that our simulation would show that a total of 7 comes up about - of the time. Example 3 At the start of this season, Major League Baseball fans were asked which American League Central team would be most likely to win the division this year. The table below gives the results of the poll. Table 3.9: Most Likely to Win AL Central Team Chicago Cleveland Detroit Kansas City Minnesota Probability 0.14 0.23 0.33 0.02 0.28 93 www.ckl2.org Using the line of random digits supplied, simulate the results when asking 10 fans who they think will win the AL Central. 73676 47150 99400 01927 27754 42648 82425 36290 Solution Notice that the probability adds up to 100%. Selecting two digits at a time works well in situations like this. Since 14% or 14 out of 100 fans support Chicago, assign 01-14 as Chicago fans. Assign 15-37 as fans who think Cleveland will win. A good trick to remember is to add the 23% for Cleveland to the ending percent for Chicago which was 24%. This will give you the end of the interval for Cleveland. Likewise, Detroit will be 38-70, Kansas City will be 71-72, and Minnesota will be 73-99 and 00. Notice that Minnesota can't use 100 as that is three digits and we are only selecting two at a time. The digit combination '00' can be used to represent 100. We now select our 10 fans. Our 2-digit pairs are 73, 67, 64, 71, 50, 99, 40, 00, 19, and 27. The table below summarizes our results. Team Chicago Cleveland Detroit Kansas City Minnesota Values 01-14 15-37 38-79 71-72 73-99, 00 S of Fans □ 2 4 1 3 In our simulation, we found that 4 out of our 10 randomly selected fans felt Detroit was going to win. While it is not exactly the 33% we were given in the original problem, it is fairly close. Once again, if we had done hundreds of trials instead of just 10, our percentages would tend to get very close to the theoretical probability according to the Law of Large Numbers. www.ckl2.org 94 Example 4 Every person is born on a different day of the month. Some people are born on the 1st and some people are born as late as the 31st. How many people must you go through until you find two that were born on the same day of the month? Simulate this one time using the random digits below. (Ignore the fact that people are not equally likely to be born on all days. It is more likely you were born on the 17th than the 31st since all months have a 17th but not all months have a 31st.) 45467 71709 77558 00095 32863 29485 82226 90056 52711 38889 93074 60227 40011 85848 48767 52273 Solution We will select 2 digits at a time as our largest value, 31, requires two digits. We will use 01, 02, ... 30, 31 and ignore 32-99 and 00. The numbers we get are 45, 46, 77, 17, 09, 77, 55, 80, 00, 95, 32, 86, 32, 94, 85, 82, 22, 69, 00, 56, 52, 71, 13, 88, 89, 93, 07, 46, 02, 27, 40, 01, 18, 58, 48, 48, 76, 75, 22, and 73. The only 'keepers' are 17, 09, 22, 13, 07, 02, 27, 01, 18, and 22. We did not get a match until we got our second 22. It took us 10 people to find a pair that were born on the same day of the month. Notice that when you get to the end of one line in a random digit table, you simple continue by moving to the next line below. Problem Set 3.3 Exercises 1) Suppose that 80% of a school's student population is in favor of eliminating final exams. a) Explain how could you assign digits from a random digit table to simulate this situation? b) Suppose you ask 10 students if they would like to eliminate final exams. Simulate a random selection of 20 students and record how many of the 20 are in support of eliminating final exams. Use line 147 from the random digit table in Appendix A, Part 1. 95 www.ckl2.org 2) Suppose that students at a particular college are asked about their class rank when they were in high school. The table below shows what they said. www.ckl2.org 96 Table 3.10: Class Rank Prob. Top 10% 0.2 Top 10% to 25% Top 25% to 50% Bottom 50% 0.4 0.3 ??? a) What must the probability be for the bottom 50%? b) Explain how you could assign digits to carry out a simulation for this situation. c) Using your set up, perform a simulation. Use 20 students in your simulation and record your results. Use line 103 from the random digit table. 3) Suppose the grades for students in your Stats & Prob. course were distributed as shown in the table below. Table 3.11: Grade Prob. A 0.20 B 0.29 C 0.35 D or F 0.16 a) Explain how you could assign digits to simulate the grades of randomly chosen students. b) Simulate the grades for 30 students. Use line 106 from the random digit table. Build a tally chart to track your results. c) How closely did your simulation match the actual distribution? 4) How many five card poker hands must you be dealt in order to get a hand with two cards that have matching values? (For example, the 7 of hearts and 7 of diamonds have matching values.) a) Explain how you will assign digits for this situation. b) Perform the simulation one time and state how many five-card hands it took for you to get your first hand with two cards that match. Use line 138 from the random digit table. 97 www.ckl2.org 5) There are some basic concepts that should be clearly understood about a random digit table. Answer the questions below. a) Is it possible to have four 6's next to each other in a random digit table? b) What percent of the digits in a random digit table are 9's? c) What should you do if you come to the end of a line of random digits and you still need more digits? 6) Suppose we have a class of 30 students and you are wondering what the chances are that there is at least one pair of students who have the same birthday. Assume that there are 365 days in a year. a) Explain how could you assign digits from a random digit table to simulate this situation? b) Perform this simulation one time and record whether or not there was a match in the class of 30 students. Use line 121 from the random digit table. Review Exercises 7) Suppose you are dealt two cards from a well-shuffled standard deck of 52 cards. What is the probability that your two cards are a king and an ace (in either order)? 8) Consider a set of 15 pool balls. Pool balls numbered 1-8 are solid and pool balls numbered 9-15 are striped. You pull two pool balls randomly out of a bag without replacement. a) What is the probability that your second pool ball will be solid if your first pool ball had an even number? b) What is P (Even (Striped)? 9) A survey of junior boys found that 97 are planning to participate in either a fall sport or in a winter sport or both. Use the Venn Diagram below to answer the questions. a) How many junior boys are planning on playing hockey or football (or both)? www.ckl2.org 98 b) How many junior boys who are planning on participating in a fall or winter sport are not represented in the Venn Diagram? c) What is P(Hockey|Football)? 3.4 Chapter 3 Review The expected value gives us the average result over the long term. We use expected value tables and the simple formula EV = (Valuel) (Probl) + (Value2) {Prob2) + ... to calculate the expected value. We can put everything together for a full probability analysis of a situation by using our probability calculations and other tools like a tree diagram. Casinos are cognizant of what the expected value is on any of their games and are confident, despite having to occasionally give away some substantial prizes, that their games will make them money in the long run. We cannot ever predict with certainty what is going to happen in a given situation, but we can always run a simulation to approximate what can happen. We will often use a random number generator or a table of random digits to help us run a simulation. Chapter 3 Review Exercises 1) Ten red marbles and 15 blue marbles are in a bag. A game is played by first paying $5 and then picking two marbles out of the bag without replacement. If both marbles are red you are paid $10. If both marbles are blue, you are paid $5. If the marbles don't match, you are paid nothing. Analyze this game and determine whether or not it is to your advantage to play 2) When two dice are rolled, you can get a total of anything between 2 and 12. a) Use the table of random digits in Appendix A, Part 1 to simulate rolling two dice 36 times. Begin on line 119. Make a chart displaying the different results that you get and how many times you get each result. b) How close was your simulation to the theoretical probability of what should happen in 36 rolls? 3) A bag contains a $100 bill and two $20 bills. A person plays a game in which a coin is flipped one time. If it is heads, then the player gets to pick two bills out of the bag. If it is tails, the player only gets to pick one bill out of the bag. How much should this game cost to play if it is to be a fair game? 99 www.ckl2.org 4) Suppose there are 38 kids in your Statistics and Probability class. Devise a system using a random digit table so that the teacher can randomly select 4 students to each do a problem on the board. Use line 137 from the random digit table to carry out your simulation and state the numbers of the four students who are selected. 5) A spinner with three equally sized spaces on it are labeled 1, 2, and 3. A bag contains a $1 bill, a $5 bill, and a $10 bill. A player gets paid the amount they pull out of the bag times the number that they spin. What should this game cost in order to be a fair game? 6) The table below shows the probabilities for how kids get to school in the morning. Table 3.12: How students get to school Method Bus Walk Car Other Probability 0.31 0.14 0.39 ??? a) What must the Other category have as a probability? b) Describe how you would assign digits from a random digit table to set up a simulation for selecting a student to find out how they got to school. c) Carry out your simulation for a total of 10 students and record your results. Use line 104 from the random digit table. 7) In a sharpshooting competition, competitors shoot at a total of 20 targets. The table below shows the probabilities associated with hitting a certain number of targets. Some shooters are perfect but even the poorest shooters still hit at least 15 targets. a) What is the most likely number of hits that a shooter will have? b) What is the expected number of hits that a shooter will have? Table 3.13: Shooting Accuracy out of 20 shots # of Hits 15 Probability 0.04 8) In a game of chance, players pick one card from a well-shuffled deck of 52 cards. If the card is red, they get paid $2. If the card is a spade they get paid $3. If the card is a face card, they get paid $5 and if the card is an ace they get paid $10. A player gets paid for all the categories they meet. For example, the King of Spades would be worth $8 because it is a spade and a face card. How much should this game cost in order to be a fair game? Image References Slot Machine http://www.gamedev.net Welcome to Las Vegas http://pilipon.wordpress.com www.ckl2.org 100 16 17 18 19 20 0.12 0.35 0.28 0.18 0.03 Race Cars http://www.thunderboltgames.com Electronic Devices http://www.topnews.in Poker Chips http://www.ppppoker.com Minnesota State Lottery http://www.mnlottery.com $100 Bills http://www.sciencebuzz.org MN Twins Logo www.twins.mlb.com Final Exams Yes http://www.york.org Pair of Jacks http://xdeal.com Dice http://goblin-stock.deviantart.com 101 www.ckl2.org Chapter 4 Data Collection Introduction What is data? Why collect data? How is data collected? Who cares anyway? • How many walleye are in Lake Mille Lacs? • Does aspirin prevent heart attacks? • What is the approval rating for the President? • How have the schools in Minnesota been doing to prepare students for success in college? • What percent of people would return money when given too much change? All of these questions (and infinitely many more) can be answered through statistics. Statisticians begin by posing a question. Then they plan a method for collecting information, called data, about that question. Next they collect the data and analyze it. The statisticians will 'look' at the data in the form of graphs or tables. They will 'analyze' the data with numerical statistics. Finally they will 'explain' what they have learned, what conclusions can be made, and what is still unknown, in a written or verbal report. 4.1 DATA Learning Objectives • Know the terminology of data collection, variables, and measurement • Understand how measurements are used in statistics • Distinguish between the various methods for data collection Data and Variables When a topic needs to be studied or a question needs to be answered, researchers often collect data in an effort to find the answer. Data is a collection of facts, measurements, or observations about a set of individuals (data is plural, the word datum refers to a single observation). There are a variety of ways to collect data in order to study topics of interest. Researchers can analyze and compare test scores for various Minnesota High Schools. Scientists can conduct an experiment to determine the effectiveness of www.ckl2.org 102 a new medication. Union leaders can conduct a census of every union member before deciding to strike. Or market researchers can survey a randomly selected sample of teenage girls to determine what qualities they look for when purchasing a new cell phone. When a topic is being studied, there are often several variables, or characteristics about the individuals, that the researchers are interested in. Each person, animal, or object being studies is one individual (or subject). The variables can be either categorical or numerical. A categorical variable (or qualitative variable) can be put into categories, like favorite colors, type of car, etc. Whereas a numerical variable (or quantitative variable) can be assigned a numerical value, such as heights, distances, temperatures, etc. Associated Press ,■ Kristin Murphy Example 1 Suppose 1845 teenage girls are to be surveyed by a cell phone company that wants to design a new cell phone that they can successfully market to females under 20 years old. The questionnaire will likely include questions related to: age, birth date, race, area code where they live, number of texts sent per month, amount of money willing to spend per month, services they want offered, features they want included, length of time they have had a cell phone, favorite colors, etc. All of these are variables, because they will vary from individual to individual. However, only some of these variables are numerical. Identify the individuals and the numerical variables. Solution Individuals: each girl who completes a questionnaire Numerical Variables are: age, number of texts per month, amount of money willing to spend per month, length of time they have had a cell phone. When determining which variables are numerical, it might help to decide whether or not it would be appropriate to calculate an average, or the range for the reported data. Age is numerical, because we can certainly report an average age of those surveyed. Even though birth date and area code may be reported as numbers, it would make no sense to report an 'average birth date' or 'mean area code'. Numbers such as these, social security numbers, or student ID numbers divide the data into a bunch of categories of one item each. They are simply used for identification and are considered categorical variables. 103 www.ckl2.org Measurement in Statistics When a topic is to be studied the researchers decide what it is they want to know about each individual. These variables of interest can be measured using different instruments and need to be reported in specific units. The instrument is the tool used to make the measurement. This instrument could be something obvious like a scale, tape measure, thermometer, or speedometer. But, it could also be a something like a questionnaire, a rubric, or an exam. The units explain what the numbers represent, and might be feet, points, pounds, degrees Celsius, miles per hour, etc. Keep in mind that data is useless unless it is in context. For example, the number 12 could mean anything. Is it $12, or 12 inches, or 12 (in $1,000), or 12 apple pies? Without knowing the units, all you have is a meaningless list of numbers. Validity, Reliability and Bias The way in which any given variable is to be measured needs to be valid and reliable. Validity refers to the appropriateness of the instrument and units used. Reliability means that the instrument can be depended upon to consistently give the same measurement (or nearly the same). If an instrument gives different results when measuring the same thing, it is not reliable, and it has a lot of variability (because the results vary a lot). Another potential problem with measurements is bias. When a measurement is repeatedly too high or too low, it is said to be biased. In other words, a biased measurement is 'consistently wrong in the same direction'. Researchers would like to limit bias in measurements as much as possible. Ideally, we hope for measurements that are valid, low in bias, and highly reliable. No measurement is perfect or necessarily accurate. Averaging repeated measurements can be a way to limit variability. Be aware though, averaging will only reduce variability (or increase reliability). Averages will not make an in invalid measurement suddenly valid. And, the average of biased measurements will still be biased. For example, if the variable being studied is the weight of all of the members of the school wrestling team, then using a scale as the instrument and pounds as the units will be valid. And, as long as the scale being used is in working order, then the measurements reported should be reliable. However, what if someone had set the scale being used to weigh the wrestlers to start at 10 pounds rather than zero? Each person who stepped on the scale would think that they were 10 pounds heavier than they actually were, resulting in biased measurements. If that were the case, using the scale as the instrument and pounds as the units would be valid (makes sense as a way to measure weight); reliable (if the same person steps on the scale again and again, they will have nearly the same result); and biased (each measurement is 10 pounds too heavy). So, even though something is wrong with this measurement, it doesn't mean that everything is wrong with it. We want valid, reliable, and unbiased measurements. www.ckl2.org 104 Example 2 Suppose that a teacher intends to base grades in a math class on the students' heights. She plans to use a tape measure as her instrument and inches as her units. Her grading system will be as follows: the shortest student will receive the lowest grades and the tallest will receive the highest grades. Comment on the validity, reliability, and potential bias of this. Solution Validity? This clearly is not a valid way to measure a student's success and assign grades, because height has absolutely nothing to do with someone's understanding of math, or grade in a math course. Reliability? The tape measure should be reliable. If used properly, each time a particular student's height is measured we will expect to get the same answer. Bias? This should not be biased. Some tall people will deserve higher grades, while some will deserve lower grades. The same will be true for students of all heights. Therefore, this teacher's method for assigning grades would be unbiased and it would be reliable (both good things), but it would also not be valid (a bad thing). She should come up with a better way to measure students' grades. Perhaps a combination of test scores and homework completion. So, keep in mind that just because a statistical measurement is bad, does not mean that everything will be wrong with it. It is important to think through each question separately: Is it valid?; Is it reliable?; Is it unbiased? 105 www.ckl2.org Rates versus Counts Something to watch out for is whether numbers should be changed to rates or percentages in order to make appropriate comparisons. For example, it would not make any sense to compare 'the number of people living in poverty' for each of the fifty states in the United States because of the variety in population sizes. Think of the number of people who live in the state of Rhode Island versus the number who live in California. It would be much more appropriate to compare 'the percentage of people living in poverty' for each state instead. Example 3 Luigi got a pair of jeans that are normally $64.95, for $52.50. Javier paid $48.75 for a pair of jeans that normally cost $58.25. Which jeans had a higher rate of discount? Solution Luigi's jeans were marked down $12.45 (64.95 - 52.50). Divide the amount of discount by the original cost (12.45/64.95) and get 0.1917. So, Luigi's jeans were marked down 19.17%. Javier's jeans were marked down $9.50 (58.25 - 48.75). Divide the amount of discount by the original cost (9.50/58.25) and get 0.1631. So, Javier's jeans were marked down 16.31%. Therefore, Luigi's jeans had a higher rate of discount. Methods for Collecting Data Once a question of interest is posed, there are different ways of collecting data. This is a quick overview of the methods for collecting data that will be studied in this chapter: sample surveys, census, observational studies, and experiments. Each will be covered in more detail in the following sections. As of now, we just want to be able to recognize which method was used or described. Sample surveys are often used as a way to collect data from just some of the people or objects being studied. Some examples of sample surveys are: mailed out questionnaires, online surveys, phone interviews, or quality control checks. Another way to collect data is through a census. This means that every single person or item in the population is checked, tested, or asked. When trying to determine whether something was a sample or a census, ask yourself if the researchers asked everyone (or tested everything). If yes, then it was a census. Sometimes it will be most appropriate to conduct an experiment - when the researchers actually 'do something' to the subjects. Observational studies are another common way to collect data. In observa- tional studies, the researchers do not 'do anything' to the subjects, they simply collect data that has already happened or happens naturally. All of these methods of data collection can yield interesting results and often answer questions. However, the only method that can actually prove that one variable causes another is an experiment. When trying to determine whether a research method was an experiment, ask yourself if the researchers did anything to the people or objects that were being studied? If yes, then it was an experiment. www.ckl2.org 106 Example 4 For each of the following scenarios, determine whether the situation described is an experiment, observa- tional study, census, or a sample survey. a) Researchers suspected that aspirin could help reduce the risk of having a heart attack. Seven hundred people, aged 40 or older, were willing to participate in a study. Half of these participants were randomly selected to take an aspirin each day. The remaining participants were given a pill that looked like the aspirin, but contained no actual medicine. The study went on for five years and the participant's health was monitored. b) In an effort to study how the high schools in Minnesota have been preparing students for college, an extensive questionnaire was developed. Ten percent of the high school juniors at every high school in the state were selected randomly to complete this questionnaire. c) Researchers suspected that tanning beds caused skin cancer. Each time a person was diag- nosed with skin cancer, they were asked a serious of questions including whether or not they had used a tanning bed. If they had, further questions were asked regarding how often, what type, and at what age, etc. d) In an effort to determine how many fish were in Lake George, the lake was drained and the fish were counted. Solution a) This is an experiment because the researchers changed something. They had the people take aspirin (or fake aspirin). b) This is a sample survey because only a part of all of the high school students were questioned. c) This is an observational study because no change was made. The researchers simply asked about past behavior. d) This is a census because every fish was counted. However, this is ridiculous!! So, let's hope they can find a better way to determine how many fish are in a lake next time! 107 www.ckl2.org Problem Set 4.1 Section 4.1 Exercises 1) Lucas is writing an article about the baseball teams for the school paper. He collects data about each player's position, batting average, number of at-bats, hits, stolen bases and whether each player is on the junior varsity or varsity team. Who are the individuals? Which variables are categorical? Which are numerical? 2) Malia has been put in charge of analyzing the employees at her company. She collects information regarding annual salary, years with the company, highest degree earned, job title, yearly contribution toward 401K, number of children, home address and phone number. Who are the individuals? Which variables are categorical? Which are numerical? 3) Determine whether each of the following variables is categorical or numerical. a) The heights of all of the volleyball players. b) The position played by all of the football players. c) The brand of mascara preferred by those surveyed. d) The numbers of texts sent per month. e) Each person's social security number. f) Each person's cell phone provider. 4) The fourth graders at Sand Creek Elementary are doing a unit on weather. There is a thermometer on the building just outside the classroom window. The students will record and analyze the temperature at 8:00 a.m. and 2:00 p.m. every school day for 5 weeks, and then create a graph and write a report based on their findings. a) Identify the variable of interest, the instrument used, and the units. b) Comment on the validity, reliability and potential bias for this study. 5) The first graders at Sand Creek Elementary are doing a unit on measurement. Each student has traced her or his own foot and cut it out. Each student will use his or her 'foot' to measure various objects around the room and school. Some of the measurements they will make are height of self and at least two other friends, width of the classroom door, length of a lunch table, etc. a) The variables of interest are the lengths, widths and heights of various objects. Identify the instrument used, and the units. b) Comment on the validity, reliability, and potential bias for this study. www.ckl2.org 108 6) Determine whether each of the following measurements would have a problem with any of the following: VALIDITY (problem would be a lack of), RELIABILITY (problem would be a lack of), BIAS, (problem would be the presence of). A measurement may have any combination of the factors. For each one with a problem, suggest a better way to make the measurement, (hint: answer similar to example #2) a) A speedometer is totally unpredictable. b) Cholesterol levels are determined by patients filling out a survey regarding their diet. c) Time is measured by using the clock on a cell phone. d) Grades in a Physics class are determined by students assessing themselves on a scale of 1 to 10. e) Grades in a statistics class are determined by students' scores on one cumulative test. f) Sobriety is determined by a breathalyzer that is calibrated to be too sensitive. 7) Super Duper High School has a total of 143 teachers. Suppose that you are a researcher who is interested in studying Teacher Effectiveness at SDHS . You intend to evaluate the effectiveness of all of the teachers for your report. a) What type of data collection method is this? b) Suggest at least two valid variables that you might study. Include an instrument that can be used to measure your variables and the units. c) Suggest at least two invalid variables that you might study. Include an instrument that can be used to measure your variables and the units. 109 www.ckl2.org 8) For each of the following scenarios, determine whether the situation described is an experiment, obser- vational study, census, or a sample survey. a) The Super Spaz Energy drink company randomly selects 2% of the cans filled each day, and tests them for volume, ingredient content, and taste. b) A government lobbyist analyzes the crime reports for the 4 counties in her community. c) New advertisements are generally tried out on focus groups before investing a lot of money to pay for airtime on national TV. d) Each student in Probability and Statistics will take the District Common Assessment as a final exam. e) A teenager decides to evaluate how serious her parents are about her curfew by coming home 15 minutes late just to see what happens. 9) Pasquale's Big and Tall Shop sold 127 suits during the first quarter of this year, and 17 were returned. Marco's XXL Shop sold 268 suits during the same time period, and 27 were returned. a) What were the number of returns for each shop? Which shop had a higher number of returns? b) What were the rates of returns for each shop? Which shop had a higher rate of returns? c) Which of these statistics gives a more clear representation of customer satisfaction? Explain. Review Exercises Rate of Change = Amount of Change Original Amount ♦> The original amount is not always the largest or the smallest amount. ♦!♦ Multiply by 100, to change to a percent. ♦> Determine whether this change is an increase or a decrease. 10) Jolene makes $12.45 per hour at her job. Last year she made $10.85. What percent of a raise did Jolene receive? 11) Michaela's favorite shoes are normally $42.99. Today she found a sale in which they were marked down to $27.99. What percent of a discount is this? 12) The number of incidents of hazing reported at Some Random High School was 84 during the 2010-2011 school year. The following year there were 37 incidents of hazing reported at SRHS. What is the rate of change in reported hazing incidents between these two school years? Is it an increase or a decrease? www.ckl2.org 110 13) SRHS has had a huge problem getting students to class on time, so the administrators have implemented a new tardy policy. In an effort to determine whether or not it is working to deter students from being tardy to class, data has been collected and analyzed. The following table shows some of the data: School Year 2010-2011 2011-2012 % of change (+ or-) Total number of tardies (to any class period) 5186 4295 Number of students with more than 10 tardies 175 59 Number of students with more than 20 tardies 112 77 a) Calculate the percent of change for each category and complete the table (round to the nearest tenth of a percent). b) Which category saw the most significant change? c) Based on these calculations, do you feel that the tardy policy is working? Explain your reasoning. Ill www.ckl2.org 4.2 Sample Survey and Census Learning Objectives • Differentiate between population and sample • Understand the terminology of sampling methods • Identify various sampling methods • Recognize and name sources of bias or errors in sampling Population vs. Sample What is the approval rate of the President? If we really wanted to know the true approval rating of the president, we would have to ask every single adult in the United States her or his opinion. If a researcher wants to know the exact answer in regard to some question about a population, the only way to do this is to conduct a census. In a census, every unit in the population being studied is measured or surveyed. In this example our population, the entire group of individuals that we are interested in, is every adult in the United States of America. A census like this (asking the opinion of every single adult in the United States) would be impractical, if not impossible. First, it would be extremely expensive for the polling organization. They would need a large workforce to try and collect the opinions of every single adult in the United States. Once the data is collected, it would take many workers many hours to organize, interpret, and display this information. Other practical problems that might arise are: some adults may be difficult to locate, some may refuse to answer the questions or not answer truthfully, some people may turn 18 before the results are published, others may pass away before the results are published, or an event may happen that changes peoples' opinions drastically, etc. Even if this all could be done within several months, it is highly likely that peoples' opinions will have changed. So by the time the results are published, they are already obsolete. Another reason why a census is not always practical is because a census has the potential to be destructive to the population being studied. For example, it would not be a good idea for a biologist to find the number of fish in a lake by draining the lake and counting them all. Also, many manufacturing companies www.ckl2.org 112 test their products for quality control. A padlock manufacturer, for example, might use a machine to see how much force it can apply to the lock before it breaks. If they did this with every lock, they would have none to sell. In both of these examples it would make much more sense to simply test or check a sample of the fish or locks. The researchers hope that the sample that they select represents the entire population of fish or locks. 113 www.ckl2.org This is why sampling is often used. Sampling refers to asking, testing, or checking a smaller sub-group of the population. A sample is a representative subset of the population, whereas the population is every single member of the group of interest. The purpose of a sample is to be able to generalize the findings to the entire population of interest. Rather than do an entire census, samples are generally more practical. Samples can be more convenient, efficient and cost less in money, labor and time. A number that describes a sample is a statistic, while a number that describes an entire population is a parameter. Researchers are trying to approximate parameters based on statistics that they calculate from the data that they have collected from samples. However, results from samples cannot always be trusted. Example 1 *** .MM**' in hi A poll was done to determine how much time the students at SDHS spend getting ready for school each morning. One question asked, "Do you spend more or less than 20 minutes styling your hair for school each morning?" Of the 263 students surveyed, 61 said that they spend more than 20 minutes styling their hair before school. Identify the population, the parameter, the sample, and the statistic for this specific question. Solution a) population (of interest): all students at SDHS b) parameter (of interest): what proportion of students spend more than 20 minutes styling their hair for school each morning c) sample: the 263 SDHS students who were surveyed d) statistic: p = 61/263 = 0.2319 = 23.19% Randomization One common problem in sampling is that the sample chosen may not represent the entire population. In such cases, the statistics found from these samples will not accurately approximate the parameters that the researchers are seeking. Samples that do not represent the population are biased. If someone was interested in the average height of all male students at his or her high school, but somehow the sample of students measured included the majority of the varsity basketball team, the results would certainly be biased. In other words, the statistics that were calculated would most certainly overestimate the average height of male students at the school. Samples should be selected randomly in order to limit bias. Also, if only three students' heights are measured, it is very possible that the average height of these three will not be close to the average height of all of the male students. The average of the heights of 40 randomly www.ckl2.org 114 chosen male students would be more likely to result in a number that will match the average of the entire population than that of just three students. Larger sample sizes will have less variability, so small sample sizes should be avoided. A random sample is one in which every member of the population has an equal chance of being selected. There are many ways to make such random selections. The way many raffles are done is that every ticket is put into a hat (or box), then they are shaken or stirred up, and finally someone reaches into the hat without looking and selects the winning ticket (s). Flipping a coin to decide which group someone belongs in is another way to choose randomly. Computers and calculators can be used to make random selections as well. The purpose of choosing randomly is to avoid any personal bias from influencing the selection process. Randomization will limit bias by mixing up any other factors that might be present. Think of the heights of those male students, if we assigned every male at that school a number and then had a computer program select 40 numbers at random, it is most likely that we would end up with a mixture of students of various heights (rather than a bunch of basketball players). Also, no one did something like just measuring their friends heights, or the first 40 males he sees who are staying after school, or everyone in first lunch who is willing to come be measured. A computer program has no personal stake in the outcome and is not limited by its comfort level or laziness. If the goal of our sample is to truly estimate the population parameter, then some planning should be done as to how the sample will be selected. First of all, the list of the population should actually include every member of the population. This list of the population is called the sampling frame. For example, if the population is supposed to be all adults in a given city and someone is working from the phone book to make selections, then everyone who is unlisted and those who do not have a land line telephone will not have any chance of being selected. Therefore, this is not an accurate sampling frame. Good Sampling Methods Simple Random Sample When the selection of which individuals to sample is made randomly from one big list, it is called a simple random sample (or SRS). An example of this would be if a teacher put every single student's name in a hat and then draws 5 names from the hat, without looking, to receive a piece of candy. In an SRS every single member of the population has an equal probability of being selected - every student has an equal chance of getting the candy. And, in an SRS every combination of individuals also has an equal chance of being selected - any group of 5 students might end up getting candy. It might be all 5 girls, it might be the 5 students who sit in the back row, or it might even end up being the 5 students who misbehave the most. Anything is possible with an SRS! Stratified Random Sample A simple random sample is not always the best choice though. Suppose you were interested in students' opinions regarding the homecoming theme, and you wanted to make certain that you heard from students from all four grades. In such a case it would make more sense to have four separate lists (freshmen, sophomores, juniors and seniors), and then to randomly select 50 students from each list to give your survey to. A selection done in this way is called a stratified random sample. A stratified random sample is when the population is divided into deliberate groups called strata first, then individual SRS's are selected from each of the strata. This is a great method when the researchers want to be sure to include data from specific groups. Divisions may be done by gender, age groups, races, geographic location, income levels, etc. With stratified random samples, every member of the population has an equal chance of being selected, but not every combination of individuals is possible. 115 www.ckl2.org Systematic Random Sample Another way to choose a sample is systematically. A systematic random sample makes the first selection randomly and then uses some type of 'system' to make the remaining selections. A system could be: every 15th customer will be given a survey, or every 30 minutes a quality control test will be run. A systematic random sample might start with a single list like an SRS, randomly choose one person from the list, then every 25th person after that first person will also be selected. Systematic random samples still give every member of the population an equal chance of being chosen, but do not allow for all combinations of individuals. Some groups are impossible, such as a group including several people who are in order on the list. www.ckl2.org 116 Multi-Stage Random Sample When seeking the opinions of a large population, such as all registered voters in the United States, a multi- stage random sample is often employed. A multi-stage random sample involves more than one stage of random selection and does not choose individuals until the last step. A pollster might start by randomly choosing 10 states from a list of the 50 states in the U.S.A. Then she might randomly choose 10 counties in each of those states. And, finally she can randomly choose 50 registered voters from each of those counties to interview over the telephone. When she is done, she will have 10x10x50 = 5000 individuals in her sample. This is another sampling method that gives individuals an equal chance of being chosen, but does not allow for all possible combinations of individuals. For example, there is no possible way that all 5000 of these voters will be from Texas. Random Cluster Sample Sometimes cluster samples are used to collect data. Splitting the population into representative clusters, and then randomly selecting some of those clusters, can be more practical than making only individual selections. In cluster sampling, a census is done on each cluster or group selected. When appropriately used, cluster sampling can be very useful and efficient. One needs to be careful that the clusters are in fact selected randomly and that this method is the best choice. When a study of teenagers across the country is to be done, a random cluster method can be the best choice. An SRS of all teens would be nearly impossible. Imagine that one big list of all teens! A multi-stage random sample might be theoretically ideal, but the practicality of surveying one teenager from a high school in Little Rock, and one from another high school in Duluth, and so on would be quite a nightmare. The best choice might be to randomly select 10 metropolitan areas, 10 suburban areas, and 10 urban areas from across the country. And then to randomly select one high school in each of these areas and then finally to randomly select 4 second hour classes from each of those high schools. Then survey the entire classes selected (clusters). This would be a combination of mulit-stage random selection and cluster sampling. Another use for random cluster sampling is quality control at a popcorn factory. If every hour, a bucket of popcorn is scooped out. The entire bucket of popcorn can be checked for salt content, appearance, number of kernels not popped, not burnt, etc. This is an example of a systematic random cluster sample, the system being 'every hour' a sample is taken, and the clusters being each bucket of popcorn. 117 www.ckl2.org Bad Sampling Methods Voluntary Response Sample Beware of call in surveys, and online surveys! Suppose that a radio hosts on KDWB says something like, u Do you think texting while driving should be illegal? Call in and have your opinion heard!" It is highly likely that many people will call in and vote "No!" However, the people who do take the time to call will not represent the entire population of the twin cities and so the results cannot possibly be trusted to be equal to what all members of the population think. The 'statistic' that this 'survey' calculates will be biased. The only people who will take the time to call in are those who feel strongly that texting while driving should be legal (or illegal). Such a sampling method is called a voluntary response sample. In voluntary response samples, participants get to choose whether or not to participate in the survey. Online, text-in, call-in, mail-in, and surveys that are handed out to people with an announcement of where to turn them in when completed, are all examples of voluntary response surveys. Voluntary response samples are almost always biased because they result in no response whatsoever from most people who are invited to complete the survey. So, most opinions are never even heard, except for those who have really strong opinions for or against the topic in question. Also, those who have strong opinions can call or text multiple times. A new problem that comes with the Internet is that many companies are offering to pay people to complete surveys, which makes any results suspect. For these reasons, the results of voluntary response samples are always suspect because of the potential for bias. Convenience Sample Another commonly used, but dangerous method for choosing a sample is to use a convenience sample. A convenience sample just asks those individuals who are easy to ask or are conveniently located - right by the pollster for example. The big problem here is that the sample is unlikely to represent the entire population. The fact that this group was convenient, implies that they most likely have at least something in common. This will almost always result in biased results. An interviewer at the mall only asks people who shop at the mall, and only at some given time of day, so many people in the community will never have the opportunity to be interviewed. When the population of interest is only mall- shoppers, this will be somewhat better than when the population of interest is community members. Even then, the interviewers choose whom to go up to and the interviewees can easily refuse to participate. With both of the bad sampling methods, the word random is nowhere to be found. That lack of randomness should serve as a big hint that some type of bias will likely be present. The scary thing is that most of the results we see published in the media are the results of convenience samples and voluntary response samples. One should always ask questions about where and how the data was collected before believing the reported statistics. www.ckl2.org 118 Example 2 Suppose that a survey is to be conducted at the new Twin's Stadium. A five question survey if developed. Population of interest: All of the 31,045 fans present that day. Sample size: 2,500 randomly selected fans. Identify specifically the sampling method that is being proposed in each scenario. Also, comment on any potential problem or bias that will likely occur. a) The first 2,500 fans to arrive are asked five questions. b) Fifty sections are randomly selected. Then ten rows are randomly selected from each of those sections. Then five seats are randomly selected from each of those rows. The people in these seats are interviewed in person during the game. c) A computer program selects 2,500 seat numbers randomly from a list of all seats occupied that day. The people in these seats are interviewed in person during the game. d) 2,500 seats are randomly selected. The surveys are taped to those 2,500 with instructions as to where to return the completed surveys. e) The number 8 was randomly selected earlier. The 8th person through any gate is asked five questions. Then, every 12th person after that is also asked the five questions. f) The seats are divided into 25 sections based on price and view. A computer program randomly selects 100 seats from each of these sections. The people in these seats are interviewed in person during the game. Solution a) This is a convenience sample. It will not represent everyone present that day. This will suffer from bias because all of these people have at least one thing in common- they arrived early. b) This is a multi-stage random sample. It will probably represent the entire popu- lation. As long as the people are in their seats and willing to answer the questions honestly, it could be a good plan. c) This is simple random sample. It will probably represent the entire population. As long as the people are in their seats and willing to answer the questions honestly, it could be a good plan. 119 www.ckl2.org d) This is a voluntary response sample. It is very likely that most of those surveys will end up on the ground or in the garbage. This will likely suffer from many people not responding. It is also probable that anyone who had an extremely negative experience will be more likely to complete their surveys. e) This is a systematic random sample. It will probably represent the entire popula- tion. As long as the people are willing to answer the questions honestly, it could be a good plan. f) This is a stratified random sample. It will probably represent the entire population. As long as the people are in their seats and willing to answer the questions honestly, it could be a good plan. Errors in Sampling Sampling Errors Some errors have to do with the way in which the sample was chosen. The most obvious is that many reports result from a bad sampling method. Convenience samples and voluntary response samples are used often and the results are displayed in the media constantly. Now, we have seen that both of these methods for choosing a sample are prone to bias. Another potential problem is when results are based on too small of a sample. If a statistic reports that 80% of doctors surveyed say something, but only five doctors were even surveyed this does not give us a good idea of what all doctors would say. Another common mistake in sampling is to leave an entire group (or groups) out of the sample. This is called undercoverage. Suppose a survey is to be conducted at your school to find out what types of music to play at the next school dance. The dance committee develops a quick questionnaire and distributes it to 12 randomly selected 5th period classes. However, what if they did this on a day when the football teams and cheerleaders had all left early to go to an out of town game. The results of the dance committee's survey will suffer from undercoverage, and will therefore not represent the entire population of your school. There is also the fact that each sample, randomly selected or not, will result in a different group of individuals. Thus, each sample will end up with different statistics. This expected variation is called random sampling error and is usually only a slight difference. However, every now and then the sample selected can be a 'fluke' and just simply not represent the entire population. A randomly selected sample might accidentally end up with way too many males for example. Or a survey to determine the average GPA of students at your school might accidentally include mostly honor's students. There is no way to avoid random sampling error. This is one reason that many important surveys are repeated with a new sample. The odds of getting such a 'fluke' group more than once are very low. www.ckl2.org 120 Non Sampling Errors One of the biggest problems in polling is that most people just don't want to bother taking the time to respond to a poll of any kind. They hang up on a telephone survey, put a mail-in survey in the recycling bin, or walk quickly past an interviewer on the street. Even when the researchers take the time to use an appropriate and well-planned sampling method, many of the surveys are not completed. This is called non-response and is a source of bias. We just don't know how much the beliefs and opinions of those who did complete the survey actually reflect those of the general population, and, therefore, almost all surveys could be prone to non- response bias. When determining how much merit to give to the results of a survey, it is important to look for the response rate ( %£%%£ ). The wording of the questions can also be a problem. The way a question is worded can influence the response of those people being asked. For example, asking a question with only two answer choices forces a person to choose one of them, even if neither choice describes his or her true belief. When you ask people to choose between two options, the order in which you list the choices may influence their response. Also, it is possible to ask questions in leading ways that influence the responses. A question can be asked in different ways which may appear to be asking the same thing, but actually lead individuals with the same basic opinions to respond differently. Consider the following two questions about gun control. "Do you believe that it is reasonable for the government to impose some limits on purchases of certain types of weapons in an effort to reduce gun violence in urban areas?" "Do you believe that it is reasonable for the government to infringe on an individual's consti- tutional right to bear arms?" The first question will result in a higher rate of agreement because of the wording 'some limits' as opposed to 'infringe'. Also, 'an effort to reduce gun violence' rather than 'infringe on an individual's constitutional right' will bring more agreement. Thus, even though the questions are intended to research the same topic, the second question will render a higher rate of people saying that they disagree. Any person who has strong beliefs either for or against government regulation of gun ownership will most likely answer both questions the same way. However, individuals with a more tempered, middle position on the issue might believe in an individual's right to own a gun under some circumstances, while still feeling that there is a need for regulation. These individuals would most likely answer these two questions differently. You can see how easy it would be to manipulate the wording of a question to obtain a certain response to a poll question. This type of bias may be done intentionally in an effort to sway the results. But it is not necessarily always a deliberate action. Sometimes a question is poorly worded, confusing, or just plain hard to understand, this will still lead to non-representative results, another thing to look at when critiquing the results of a survey is the specific wording of the questions. It is also important to know who paid for or who is reporting the results. Do the sponsors of this survey have an agenda they are trying to push through? A major problem with surveys is that you can never be sure that the person is actually responding truthfully. When an individual responds to a survey with an incorrect or untruthful answer, this is called response bias . This can occur when asking questions about extremely sensitive, controversial or personal issues. Some responses are actual lies, but it is also common for people just to not remember correctly. Also, sometimes someone who is completing a survey or answering interview questions will 'mess with the data' by lying or making up ridiculous answers. Response bias is also common when asking people to remember what they watched on TV last week, or how often they ate at a restaurant last month, or anything from the past. Someone may have the best intentions as they complete the questionnaire, but it is very easy to forget what you did last week, last 121 www.ckl2.org month, or even yesterday. Also, people are often hurrying through survey questions, which can lead to incorrect responses. So the results on questions regarding the past should be viewed with caution. It is difficult to know whether or not response bias is present. We can look at how questions were worded, how they were asked, and who asked them. Person-to-person interviews on controversial topics carry a definite potential for response bias for example. It is sometimes helpful to see the actual questionnaire that the subjects were asked to complete. There are sometimes mistakes in calculations or typos present in results, these are processing errors (or human errors). For example, it is not uncommon for someone to enter a number incorrectly when working with large amounts of data, or to misplace a decimal point. These types of mistakes happen frequently in life, and are not always caught by those responsible for editing. If a reported statistic just doesn't seem right, then it is a good idea to recheck calculations when possible. Also, if the numbers appear to be 'too good to be true', then they just might be! gmt j * - ~ Example 3 The department of health often studies the use of tobacco among teens. The following is a description by the Minnesota Department of Health describing how they chose the sample for the 2008 Minnesota Youth Tobacco and Asthma Survey. In 2008, they had 2,267 high school students complete surveys and 2,322 middle school students complete surveys. Each student in the sample completed an extensive questionnaire consisting of many questions related to tobacco use. Answer the questions that follow. To see the entire report go to: http://www.health.state.mn.us/divs/hpcd/tpc/reports/ Students were selected for the survey in two stages. First, 48 public middle schools (grades 6-8) and 51 public high schools (grades 9-12) were randomly selected, with probability of selection based on size of enrollment. Alternative schools and charter schools were included. The sample schools were randomly chosen by CDC using enrollment data provided by the Minnesota Department of Health. Next, three or four classrooms within each participating school were randomly selected, and all students in these classrooms were invited to participate. The number of schools and classrooms selected was reduced substantially in 2008 in order to reduce the burden on schools. The sample size is still adequate to provide reasonable statewide estimates. a) What type of sampling methods were used for this? b) Identify the population, the parameter of interest, the sample, and the individuals for this study c) When asking teens about tobacco use, what types or causes of bias will likely be present?. What could be done to limit bias? d) This graph shows how the percent of teens using tobacco has changed from 2000 to 2008. Identify the statistics that were found for this question in 2008. www.ckl2.org 122 Percent using any tobacco in last 30 days I a. 50 40 30 20 10 38.7 12.6 34.4 11.2 29.3 27.0 9.5 6.9 I — i 1 — i — I — i — I — i 1 — i — i — r 2000 2002 2005 2008 ■+— High school - -m- - Middle school Solution a) This study used a complicated combination of sampling methods. They used a stratified, multi-stage, random cluster sample method to select individuals. It was a stratified random sample (by high school and middle school), it was a multi- stage random sample (first random schools were selected, second random classes were selected) , and it was a cluster sample (every student in each class was given the survey) . b) population (of interest): all middle and high school students in Minnesota parameter (of interest): teen tobacco use sample: 2267 high school, and 2322 middle school students in Minnesota (from 48 public middle schools and 51 public high schools in Minnesota) individuals: each student who completed a survey c) Response bias: Tobacco use is not legal for people under 18, so teens will not want to tell the truth if they think they may get in trouble. Non-response bias: Some people were absent the day survey was given. Undercoverage: Only public school students were included, so those who attend private schools were left out. Wording of the questions: This could be a problem, but we do not know the exact wording so cannot be sure. 123 www.ckl2.org To avoid the response bias factor, surveys regarding controversial topics should all be anonymous. If you read further into this report, you will see that the students were assured all results would be anonymous (no names or ID numbers included). To avoid the non-response bias factor, students who were absent could be given the survey when they return to school. To avoid the undercover age of private school students, private schools could be in- cluded in the sample. d) In 2008, 27.0% of the high school students asked and 6.9% of the middle school students asked had used tobacco in the last 30 days. www.ckl2.org 124 Problem Set 4.2 Section 4.2 Exercises 1) For each of the following, determine whether the bold number is a parameter or a statistic, (hint: remember that a parameter is a number that represents an entire population and a statistic is a number that represents a sample) a) The average height of all oak trees is 42.3 feet. b) Ms. Anderson's class average on the final exam was 71.4%. c) The average number of songs that the students surveyed have on their iPods was 791 songs. d) Itunes reports that the average number of songs people have on their iPods is 503 songs. e) The sticker on the Super Speedster Sport Sedan says 17.82 mpg. f) Martin had to keep track of how much time he spent watching TV for a whole week. He found that last week he averaged 3.4 hours of TV per day. 2) Minnesota's Best High School found that last year they did not have enough seats or room for all of the family members who wished to attend the graduation ceremony. The administrators at MBHS need to decide where to hold the graduation ceremony this year, so they sent a questionnaire home with each of this year's 543 seniors early in September. They asked for the surveys to be completed and returned by September 27th. Of the 148 surveys returned, the average number of seats that will be needed is 6.2. To be safe, the administrators use 7 and determine that they will need a hall that can hold 3800 people (543 students X 7 seats = 3801 seats needed). Using this number they find an appropriately sized hall and reserve it. Identify each of the following as specifically as possible. a) population (of interest) b) parameter (of interest) c) sample d) statistic e) sampling method that was used f) What is the response rate (the percent of surveys returned)? g) What is wrong with the what these administrators have done? What type of bias or error is likely present? h) Will the statistic most likely by too high or too low? What is a likely consequence of this biased result? 125 www.ckl2.org 3. Suppose that a survey is to be conducted at Minnesota's Best High School. Population of interest: 2640 MBHS students. Sample size: 240 MBHS students. Identify specifically the sampling method that is being proposed in each scenario. Also, comment on any potential problem or bias that will likely occur. a) Every freshman's name is put on a slip of paper and put into a giant bucket. Sixty names are pulled out of the hat. This process is repeated for each grade level. b) A list of all students is obtained from the counselors. Julie randomly selects a number between 1 and 2640 and then finds the student that matches this number on the list. She then selects every eleventh person on the list after that one (cycling back to the beginning of the list) until 240 names are chosen. c) Surveys are handed out with lunches. Students are asked to complete them and turn them in on a table in the front of the cafeteria. d) A computer randomly selects 240 names from the entire list of students in the school database. e) Twelve teachers are randomly selected. Two of each of their classes are then randomly selected. Ten students from each of these classes are then selected. f) Three teachers, Mr. Niceguy, Mr. Greatguy and Mr. Happyguy, each volunteers to survey the students in all of his classes. www.ckl2.org 126 4. Name, and briefly describe, the type of bias that would most likely be present in each of the following situations: [ I'M fIlumg out ^**^ *A { FOR t&wtts SEt, TUE^ jfcSKtD HOW Mgcu (toHE^ I WOW <M* WW EKH ViBEt. SG I WROTC, "*500." EQU. m W*t. i PUT J 43: W WHEN "lUE^ hSKEO *UUfl WH I^MO^TE FLMQR ft, V WRQJTfc NAME SQWE AKWSiHG / MESSING - ftfc SXSJ, ^*-\ WITH OWft a) What is the name of the type of bias in the cartoon? b) As the 2010 Census was being conducted, many people did not return their forms. What type of bias is this? c) What type of bias would most likely be present if high school students are interviewed about their drinking and drug use habits? Would the statistics most likely over- or under-estimate the true parameters? d) What is the one type of sampling error that we expect to happen, but cannot do anything to avoid, called? e) When calculating the statistics from a survey, a typo is made. What type of error is this? f) A radio talk show host asks, "Do you think that the driving age should be changed to 18?" What type of bias will most likely be present? Why is this? g) If a survey is conducted by door-to-door interviews and the interviewers skip a few neigh- borhoods that 'make them nervous', what type of bias is this called? h) If an interviewer asks each person, "Do you prefer Pizza Ickarooni, or the delicious fresh flavors of Pizza Delicioso?", what type of bias is present? Review Exercises 5) One die is rolled, what is the chance that a number greater than four or an even number is showing? 6) One die is rolled, what is the chance that a number greater than four and an even number is showing? 7) Two dice are rolled. What is the probability that the sum of the number of dots showing is nine or greater? 8) If three dice are rolled, what is the probability of getting three of a kind (all 3 dice show the same number of dots)? 127 www.ckl2.org 4.3 Random Selection Learning Objectives • Obtain a random sample using a random digit table • Describe the process followed to obtain an SRS • Outline an appropriate sampling method Random Selection We have discussed that it is important to choose samples randomly in order to reduce bias, but we haven't discussed how to actually carry out the process. There are many ways to make random selections. A common way to choose things at random is to use a 'big hat', or box, or bowl, etc. For example, suppose that a teacher wanted to randomly select 5 students every day, from a class of 34 students, to hand in their homework to be graded. Each day she has all of the students' names in a big fish bowl. She will mix the names up and select 5 names. These students will turn their homework papers in right then, and the other students will not need to. The five selected names will be put back in the fishbowl, so they may be selected again tomorrow. This is an example of an SRS of size 5 of her class. Every student has an equal probability (5/34 or 14.7% chance) of being required to turn in his or her homework on any given day and any combination of five students may be chosen. One student may end up turning in her assignment several days in a row, while another student may never need to turn hers in all year long. The idea of a 'big hat' is a good method for random selection when working with small populations, but it is not always practical. Random selections can be made by flipping coins, rolling dice, or spinning a spinner. These days, most random selections can be done using technology such as a computer program or a random number generator on a graphing calculator, Another way that random selections are made in statistics is by using a random digit table. A random digit table is a long list of randomly generated digits from to 9. The digits are listed in groups of five simply to make it easier to read and not lose your place. Imagine that someone has a ten-sided die with each digit from to 9 marked on a side. They sit down, roll the die and write down the digit that appears, then they roll it again and write down the digit that appears, then they do this again and again. As you can imagine, this would take quite awhile, but would result in a long list of random digits. This is basically what a random digit table is. How to Use a Table of Random Digits There is a process to follow when using a random digit table to make your selection. You need to report your process with enough detail that if someone else were to follow your steps, they would end up with the exact same randomly selected numbers. The purpose of this is to prove, if needed, that your selection process was truly random so that no one can accuse you otherwise. The following example illustrates the steps you will need to follow (and explain) when using a random digit table to make your random selection. www.ckl2.org 128 Example 1 Five boxes, each containing 24 cartons of strawberries, are delivered in a shipment to a grocery store. The produce manager always selects a few cartons randomly to inspect. He knows better than to just look at some of the cartons on the top or only in one box, because sometimes the rotten ones are on the bottom. Today he wishes to select a total of 6 cartons to inspect. He has the boxes arranged in order and has a set way to count the cartons inside each box. Explain the process used to make the random selection using a random digit table. Solution Step 1: Assign numbers to the list (must all be an equal number of digits long) Since he has 120 cartons total, he will assign the numbers 001 to 120 to represent the cartons in order. Step 2: Choose a starting line on the random digit table. If the problem states a line to start at, use that line. Otherwise, pick any line you want and record the line number. If you run out of digits, simply move to the next line down. He will use line 119 to make the selections. Step 3: Decide how many digits to look at each time. The number of digits in your largest number is required. He will need to look at 3-digit numbers every time. Step 4: Decide if any numbers will need to be ignored and whether or not repeats will be allowed. He will not want to inspect the same carton twice, so he will ignore any repeats. And, any numbers above 120 will not apply in this case, so he will ignore numbers 121-999 and 000. Step 5: State when to stop. He will stop once six numbers are selected. He will then find the cartons that the numbers represent and inspect those cartons. Step 6: Report the numbers that were selected. When given a specific list, go back and determine which specific individuals have been selected. Here is a part of the random digit table so that you can see how the selection was made. Note that dividers have been placed between each group of 3- digits for this example. When we reach the end of a line, we simply continue on the following line. 129 www.ckl2.org Lineti Line 119 Line 120 Line 121 Line 122 Line 123 random (ligitsin groups of five: 958|57 0|711|8 87|664| 920|99 5|880|6 66|979| 986|24 8|482|6 35|476| 559|72 3|942|1 65|850| 042J66 3|543|5 43|742| 119137 7|148|7 09|984| 290|77 1|486|3 61|683| 470|52 6|222|4 5110251 138 73 8|159|8 95|052| 909| 0g Tf 359|2 75|186| 871|36 9|576|1 54|580| 815|07 2J7102 56027 55892 33063 41842 81868 selections: no values fit the range #042 and #119 #025 #052 and #087 #072 that is six^so we stop,, So, the strawberries in cartons numbered 042, 119, 025, 052, 087, and 072 will be inspected. The entire deliverywill be accepted or rejected based on this random sample of 6 cartons. Example 2 Five of the employees at the Stellar Boutique are going to be selected to go to a training in Las Vegas for four days. Everyone wants to go of course, so the owner has decided to make the selection randomly. She has decided to send two managers and three sales representatives. The employees' names are listed in the table below. a) What type of sampling method is this? b) Explain the process she can follow to use a random digit table, starting at line #108, to select the employees who will get to go to the training. Select the managers first, then select the sales representatives. Managers Angela Barbara Elise Gigi Malena Rosie Tammy Veronica Sales Representatives Sales Representatives Sales Representatives Alfie lima RavAnne Bertie Lou Jo Jo Sandy Cari Katarina Shirley Carrv Lin Suzi Darcy Marcie Tawanda FanFan Nancy Wendy Heidi Oprah Zulu Solution a) This is a stratified random sample. b) For the managers: www.ckl2.org 130 Assign numbers to the list 1 to 8 Use random digit table, starting at line #108 Look at one digit at a time Ignore 9, 0, and any repeats Stop when two have been selected State the names Managers 1- Angela 2- Barbara 3- Elise 4- Gigi 5-Malena 6-Rosie 7- Tamnrv 8- Veronica Line # Random digits in groups of five Selections Line 108 60 9 40 72024 17868 24943 61790 90656 87964 18883 #6 and #4 So, Rosie(#6) and Gigi(#4) will be the managers who get to go to Las Vegas. For the sales representatives: • Assign numbers to the list 01-21 • Use random digit table, starting on the next line, #109 • Look at two digits at a time • Ignore 22-99, 00, and any repeats • Stop when three have been selected • State the names Sales Representatives Sales Representatives Sales Representatives 01-Alfie 08- lima 15- Ray Anne 02- Bertie Lou 09- Jo Jo 16- Sandy 03- Can 10-Katarina 17- Shirlev 04- Carry 11- Lin 18-Suzi 05-Darcv 12- Marc ie 19-Tawanda 06- Fan Fan 13- Nancy 20- Wendy 07-Heidi 14- Oprah 21- Zulu Line # Random digits in groups of fixe Selections Line 109 36 00 9 193 65 15 412 3 9638 85 45 3 4 68 16 83 48 5 4 19 79 #15. #16. & #19 So, Ray Anne(#15), Sandy (#16), and Tawanda(#19) will be the sales representa- tives who get to go to Las Vegas. 131 www.ckl2.org Problem Set 4.3 Section 4.3 Exercises Use the table of random digits in the appendix for the following problems. 1. The manager at Big-N-Nummy-Burger wishes to know his employees' opinions regarding the work environment. He has 56 employees and plans to select 12 employees at random to complete a survey. a) Explain the process he can follow to use a random digit table, starting at line 108, to select an SRS of size 12. b) Which employees numbers were selected? 2. Use a random digit table to select an SRS of five of the fifty U.S. States. Explain your process thoroughly and report the five states that you chose. Repeat this a second time, but begin on a different line on the random digit table. Compare your lists to another classmate's lists. Did you end up with any of the same states in your samples? Table 4.1: Alabama California Florida Illinois Kentucky Massachusetts Missouri Alaska Colorado Georgia Indiana Louisiana Michigan Montana Arizona Connecticut Hawaii Iowa Maine Minnesota Nebraska Arkansas Delaware Idaho Kansas Maryland Mississippi Nevada New Hampshire North Carolina Oregon South Dakota Vermont Wisconsin New Jersey North Dakota Pennsylvania Tennessee Virginia Wyoming New Mexico Ohio Rhode Island Texas Washington New York Oklahoma South Carolina Utah West Virginia www.ckl2.org 132 3. Washington High School has had some recent problems with students using steroids. The district decides that it will randomly test student athletes for steroids and other drugs. The boy's hockey team is to be tested. There are 13 players on the varsity team and 21 players on the junior varsity team. Use a table of random digits starting at line 122, to choose a stratified random sample of 3 varsity players and 5 junior varsity players to be tested. Remember to clearly describe your process. Varsity Team Roster, by last name: Alexander Nix Baker Radamacher Brooks Ritchie Finch Smithe Gustaf Thomas Under West Mullen Junior Varsity Team Roster, by last name: Andersen Manzel Anderson Peterson Baker Randal A Christian Randal J Donnovan Reeder Greene Rice Hansen Sams James Sentel Klein Thorne Under West Lutz Review Exercises 4) Sketch a Venn Diagram that shows two events that are mutually exclusive. 5) Suppose that a survey was conducted at SRHS and it found that 86% of students have their own cell phones, and that 64% of students have their own IPod (or other similar personal music device). Furthermore, 9% of the students at SRHS say that they have neither one of these. a) Define your variables and construct a Venn Diagram that fits this scenario. b) What is the probability that a randomly selected student has both a cell phone and an IPod? c) What is the probability that a randomly selected student has either a cell phone or an IPod? 133 www.ckl2.org 4.4 Statistical Conclusions Learning Objectives • Understand when valid statistical conclusions can be made. • Calculate an estimated margin of error and 95% confidence interval • Make confidence statements Statistical Conclusions Remember that when you collect information from every unit in a population, it is called a census. In doing a census, we can be certain that the numbers we have calculated really do represent the entire population. But, because a census is often impractical, we generally take a representative sample of the population, and use that sample to try to make conclusions about the entire population. The downside to sampling is that we can never be completely, 100% sure that we have captured the truth about the entire population. For example, imagine taking a random sample of 100 from a large population. Put those back and choose another sample of 100, repeating many times. Each of these samples of size 100 will include a different combination of 100 members of the population. Thus, each sample will result in different statistics. This natural difference between various samples is an expected random sampling error. To take this into account, researchers generally report their findings to have a margin of error or to be within a certain range of possible values. This range is called a confidence interval. For example the President's approval rating might be reported as, "The approval rating for the President is 43.2%, with a margin of error of ±3%." Which could also be reported as, "The approval rating for the President is between 40.2% and 46.2%." Using a statistic to make a conclusion about a population is called statistical inference. This course is an introduction course, so we will only briefly touch on this idea. In a future statistics class, you will learn much more about statistical inference and calculations. It is important to note that statistical conclusions are meaningless when poor sampling techniques have been used. If the data was collected from a voluntary response sample, or you had a low response rate, or an incomplete sampling frame was used, then don't waste your time performing inference on your statistics. Random sampling error is the only type of error or bias that the margin of error accounts for. 95% Confidence Intervals Once a statistic is calculated for a sample, it is used as an estimate for what the actual parameter might be. We do not know whether our statistic is close to the population parameter, or if it is too high, or too low, so we build our interval around the statistic. We add the margin of error to, and subtract the margin of error from, our statistic. We then report this range of values as our confidence interval, the interval that we are fairly confident that the true parameter must be within. In a more formal course you will learn how to calculate the margin of error more precisely, and for various levels of confidence (such as 90% or 99% etc.). In this course we will use a simple formula that estimates the margin of error for a www.ckl2.org 134 95% confidence interval. We will also make a 95% confidence statement, which explains our conclusion regarding the population parameter in context. The formulas for an estimated 95% margin or error and confidence interval are: Margin of error formula: , 1 m.e.= +-p Confidence interval: p + ?n. e. or statistic ± margin of error n — sample size *note: In order to make a smaller margin of error, and therefore a more narrow confidence interval, one must increase the size of the sample. Once you have found the range of numbers for your confidence interval, you are going to state your conclusion in context. Such a statement is called a confidence statement. The confidence interval refers to the population - not the sample. We are 100% certain of our sample statistic. It is the population parameter that we are estimating. Writing a confidence statement can be kind of confusing, so you can just use the following template: "We are 95% confident that the true proportion of will be between (low value of CI) and (parameter of interest) _(high value of CI) **** Example 1 A random sample of 125 union members was conducted to see whether or not the union members would support a strike. Sixty-four of those surveyed said that they would support a strike unless safety conditions were improved. Identify each of the following as specifically as possible: a) Population of Interest: All members of this union b) Parameter of Interest: The percent of the union members who would support a strike 135 www.ckl2.ors c) Sample: The 125 union members who were surveyed d) Statistic: (p-hat) = ^ = 0.512 e) Margin of Error: m.e. ±-7= = ±0.0894 f) 95% Confidence Interval: 0.512 + 0.0894 = 0.6014 and 0.512 - 0.0894 = 0.4226 [0.4226 to 0.6014] or [42.26% to 60.14%] g) Confidence Statement: "We are 95% confident that the true proportion of union members who would support a strike is between 42-26% and 60.14%" (Note- Two additional Videos in the Example) < ## " <**" fl C3 www.cki2.0rg 136 Problem Set 4.4 Section 4.4 Exercises 1. A survey was done to determine the texting habits of MBHS students. An SRS of 270 students were asked several questions related to texting and cell phone usage. Of particular interest to the researchers was the proportion of students who text while in class. Of those surveyed, 178 said that they text during class at least ten times per week. Identify each of the following as specifically as possible. a) Population of Interest b) Parameter of Interest c) Sample d) Statistic e) Margin of Error f) 95% Confidence Interval g) Confidence Statement h) Do you personally feel that this is too high or too low of an estimate of the proportion of teens at your high school who text during class? 2. To predict the outcome of an upcoming Mayoral election, a random sample of 814 voters was selected. These people are asked several questions regarding the election. One question asked whether they were "...leaning Republican, Democratic, Independent, or other/undecided?" Based on this question, 38.2% of respondents said that they were "..leaning Democratic...". Identify each of the following as specifically as possible. a) Population of Interest b) Parameter of Interest c) Sample d) Statistic e) Margin of Error f) 95% Confidence Interval g) Confidence Statement 137 www.ckl2.org 3. In the same survey, 42.3% said that they were "...leaning Republican...". a) Calculate an estimated 95% confidence interval b) Is this enough evidence to "call" the election in favor of the republicans? Why or why not? 4. The quality control officer at Spaz Cola uses a systematic random sampling method to select cans of Spaz Cola to determine whether the machines are maintaining the correct recipe. Among the 480 cans analyzed today, 43 cans contained less sugar than the Spaz recipe requires! Identify each of the following as specifically as possible. a) Population of Interest b) Parameter of Interest c) Sample d) Statistic e) Margin of Error f) 95% Confidence Interval g) Confidence Statement h) Do you think that the company should be concerned? Why or why not? Review Exercises 5) Marcus got 18 points correct, out of 42 possible points, on his science test. On his history test, Marcus got 31 points out of 55 possible points. On which test did Marcus do better? Explain or show how you know. 6) Lydia got 15 points correct on her probability quiz (out of 23 possible). Then she earned 37 points, of the 48 possible points on her probability test. On which of these assessments did Lydia do better? Explain or show how you know. 7) The figure below is a dartboard. Suppose that a dart is thrown at it randomly. What is the probability that the dart will land on the shaded area? www.ckl2.org 138 E II JZ. 5 / r=5m \ €^ Length =28m 8) Sketch two different "dart boards" such that the probability of hitting the shaded are is equal to one- third. 139 www.ckl2.org 4.5 Experiments and Observational Studies Learning Objectives • Know the terminology of basic experimental design • Identify the elements of an experiment • Distinguish between observational studies and experiments • Outline experiments • Understand the effects of lurking variables Observational Studies and Experiments When researchers collect data about subjects without imposing any type of treatment, they are doing an observational study. Many conclusions have been based on observational studies. The discovery that smoking causes lung cancer was initially theorized based on observational studies. Many consumers of cigarettes and tobacco companies questioned the validity of such studies, suggesting that it could have been some other variables that caused the cancers, not the cigarettes. Retrospective studies, based on past history of lung cancer patients showed that a high proportion of them were smokers. This did not convince those who either enjoyed smoking, or were making money off of tobacco. There could be some lurking variables to blame, extra variables that were not taken into account, but were actually the cause. Prospective studies, following people in the future, were undertaken in an effort to see whether or not there was a link between cigarette smoking and lung cancer. The statistics were still called into question because statisticians know that the only way to truly show causation is through a controlled experiment. An experiment imposes some 'treatment' on the subjects. A controlled experiment involves having more than one group, where the only variable that is different between the groups is the treatment being tested. And, subjects will need to be assigned at random (left to impersonal chance) to the various treatment groups to control for lurking variables. With regard to cigarettes and lung cancer, researchers would need to find a group of non-smokers and randomly divide them into two groups. The randomization will divide up lurking variables that the researchers cannot control for. Also, there needs to be a fairly large number of subjects in each group so that the results do not appear to be some kind of a fluke. The researchers would then need to force one group to smoke cigarettes, while making sure that those in the control group did not smoke. This would go on for several years and both groups would need to be checked for lung cancer regularly. Clearly, there is no ethical way to do such an experiment. We cannot force people to do something that we suspect may cause cancer! Scientists were able to experiment on rats to see whether or not cigarettes caused cancer, and it did. Eventually, the compilation of all of these studies convinced everyone that smoking does cause cancer. The Three Elements of Good Experimental Design are: 1. Randomization— Subjects must be randomly assigned to treatment groups in an effort to divide up any lurking variables www.ckl2.org 140 2. Control-There should be a control group- a group that does not receive the treatment. Having more than one group, where the only difference is the treatment being tested, allows for comparisons to be made. 3. Replication-There should be a large enough number of subjects so that the results seem believable. Also, the experiment should be able to be replicated on a different group of subjects. Experimental Design In an experiment, the people, animals, or objects, that are being experimented on are called the subjects. The treatment that is being tested is the explanatory variable. The result, outcome, or change that happens (or doesn't happen) is the response variable. Keep in mind that sometimes it is necessary to give a pre-test prior to imposing the treatment. For example, if we are testing a medication that claims to lower cholesterol levels, we will certainly need to know the cholesterol levels of all of our subjects prior to giving them the treatment. At the end of the experiment we will again test them and then we can compare any change in cholesterol level. The control group may be given no treatment at all. Or, you may want to use the control group as a way to compare a new treatment to an old treatment. For example, if someone has developed a new medication that they believe will cure headaches, they will want to compare it to aspirin, acetaminophen, and ibuprofen. Such researchers will likely form four randomly assigned groups (Groups A, B, C, and D), assigning the subjects in each respective group to take a specific one of the treatments whenever they have a headache and to record whether or not it worked and how quickly. After some length of time, the researchers will collect the data from the four groups and compare the results. With the only difference being which treatment was taken, researchers can make conclusions determining which treatment (if any) worked better than the others. 141 www.ckl2.org ©QDGDDki® feCT ©DO lSS|p©[jfl[JijD@DQG©D ©©©BgjJOD describe the subjects as specifically as possible randomly assign (explanatoryvarible) describe the treatment forth is group randomly assign (explanatory variable-this is often the control group) describe the treatment forth is group specifically explain what will be compared (response variable) Mention other elements of the experimental design: Blocking? Blind? Double-Blind? Placebo controlled? Length of time of the treatment? There are some other potential problems here though. For instance, would you want the subjects to know which medication they are receiving? It is very possible that they may have some preconceived notions regarding the effectiveness of one or more of these medicines. Such unconscious bias can influence how they perceive the treatment to work. What researchers often do to avoid any bias that the subjects will bring with them is to not tell them what treatment they are receiving. Such an experiment is said to be blind. It is also possible that the researcher may have preconceived notions, or hopeful expectations, regarding the effectiveness of one or all of the treatments. To avoid this, a third party can package the various treatments in similar looking containers, each marked only with a code, before the researcher distributes them to the subjects. In this case neither the subjects nor the researcher distributing the treatments know who is getting what. This is a double blind experiment, and is used often in clinical trials to limit bias. Another issue is that often a patient's symptoms may improve just at the 'idea' of getting a medication. This is called the placebo effect. Imagine a child who is crying dramatically over a scraped knee, but stops immediately once mom puts a bandaid on. The bandaid is the placebo. It is also common for a participant, who believes that she or he is receiving a potentially promising medication, to have symptoms improve simply because of her or his expectation that they will. To account for this placebo effect, researchers will often give the control group a fake treatment called a placebo. A placebo is sometimes called a "sugar pill"- it looks like the real treatment, but has no active ingredients. Placebos make blind and double-blind experiments possible. An experiment could involve a placebo shot, or even a placebo surgery (aka sham surgery). We will demonstrate how to outline an experiment through the following examples. See the sample outline above as a reference. Example 1 Suppose that a group of scientists have developed a medication that they believe will cure mean-ness. They are calling it Kind At Last (KAL). There are 520 mean people who are willing to participate in this study (300 males and 220 females). This pill needs to be taken twice daily and it may take a few weeks to be fully absorbed into a person's system. Identify the following, and outline a completely randomized experiment. www.ckl2.org 142 a) Subjects b) Explanatory Variable c) Response Variable d) Will it be blind? Double-blind? placebo controlled? is a pre-test necessary? e) Outline a completely randomized experiment Solution a) Subjects: the 520 mean people (330 male & 220 female) b) Explanatory Variable: the KAL pills c) Response Variable: any change in mean-ness d) will it be blind? double-blind? placebo controlled? is a pre-test necessary? this could definitely be placebo controlled and double-blind. Neither the patients, nor the person distributing the medicine will know which people are receiving which medication. The KAL pills and the placebos will look identical and be in similar packages. e) Outline a completely randomized experiment: 260 randomly assigned mean people Take KAL pill twice daily 260 randomly assigned mean people Take Placebo pill twice daily Compare any change in "mean-ness", attitude, bullying, opinions of relatives, etc. I •Subjects will take the pi I Is for four months. '*This experiment will be double-blind. The previous example is the a completely randomized experiment because all of the subjects started in one group. All subjects were then randomly assigned to treatment groups, with any combination of genders being possible. What if it was theorized that this medication actually has different effects on males than on females? With a completely randomized design it is very possible that we would not end up with an equal number of males and females in each treatment group. If that were to happen, we would not be able to tell whether the treatment affected different genders in the same way or not. 143 www.ckl2.org Randomized Block Designs In such a case, it is a good idea to involve blocking in your experimental design. When it is suspected that different subgroups may respond differently to the treatment, the statisticians separate them at the beginning into intentional subgroups called blocks. The subjects in an experiment may be blocked by age, gender, race, previous medical history, etc. Be sure that you do not say that you will randomly assign to the blocks. You cannot randomly choose who is male or female, and you cannot randomly choose who is which race, etc. Each block is then randomly divided among the various treatment groups. This assures a more equal distribution of the subjects among the treatments. It also directly addresses the effects of this suspected lurking variable. Experimental designs in which blocking is used are called randomized block designs. Example 2 Outline a randomized block design to test the KAL pills that blocks by gender. (Continued from example 1) Solution Outline a randomized block design— 150 randomly assigned mean males Take KAL pill twice daily 150 randomly assigned mean males Take Placebo pill twice daily Compare any change in "mean-ness", attitude, bullying, opinions of relatives, 220 Mean females 110 randomly assigned mean females Take KAL pill twice daily 110 randomly assigned mean females Take Placebo pill twice daily Compare any change in "mean-ness", attitude, bullying, opinions of relatives, * Blocked by gender. *Subjects will take the pills forfour months. Thisexperiinentwil! be doubie-blind. 5 www.ckl2.org 144 Once you have done the comparisons within blocks, you will also want to compare across blocks to see if there are differences. For example, perhaps this KAL medicine works really well on males, but doesn't do a thing for females. 145 www.ckl2.org Section 4.5 Exercises 1. Researchers want to determine how effective a new allergy drug called Scratch-Be-Gone is at reducing pet allergies. One pill should be taken daily with a meal. 450 pets suffering from allergies will participate in a clinical study comparing this new drug with an existing market drug and a placebo. Identify each of the following: a) Subjects b) Explanatory Variable c) Response Variable d) Is it possible for this experiment to be double-blind? Explain. e) Outline a completely randomized experiment 2. Ms. Rokinroll has a theory that listening to music while working on probability problems will help students retain knowledge. She has a set of earphones for each student and intends to compare the effects of classical music, country music, and heavy metal music. Her first period probability class has 36 students and her last period probability class has 34 students. Identify each of the following: a) Subjects b) Explanatory Variable c) Response Variable d) Do you feel that a control group of no music is necessary? Why or why not? e) Do you feel that any of the following should be a part of this experiment: blind, double-blind, pre-test? placebo controlled? f) Outline a randomized block design experiment www.ckl2.org 146 3. Researchers want to test a new eye drop against Blink Brand Eye Drops to see if it is better at reducing dry eye symptoms for contact wearers. The researchers are also interested in whether males and females will respond differently. The subjects available are 480 male and 502 female contact wearers who suffer from frequent dry eyes. Identify the following: a) Subjects b) Explanatory Variable c) Response Variable d) Outline an appropriate experiment: (will it be blind? double-blind? blocked? placebo controlled?) e) Clearly explain how a table of random digits can be used to do the randomization. Using line #129 select the first five males who will be in the first treatment group. 4. A new type of cell phone is being developed by The Millionaire Phone Makers Corporation. This phone, called Make-Us-More-Money (MUMM), has a target audience of college students and young professionals (18-35 year olds). The company has developed three different ad campaigns - a commercial for each has been made and will be tried on the test subjects. The company wants to determine which ad campaign will be most effective prior to flooding the market, so they will test the various commercials on 744 University of Wisconsin students, and 3,057 people who attend this year's Young Professionals Conference in Los Angeles. After viewing a commercial, each subject will fill out a questionnaire that test how likely they would be to purchase the MUMM phone. Identify each of the following: a) Subjects b) Explanatory Variable c) Response Variable d) Outline an appropriate experiment (it will need to be blocked by the two different locations) e) This scenario is different than the previous examples, because in the other examples we were able to do the randomization in advance. That would not be possible for something like this in both locations, the subjects will walk up to the researcher and need to be assigned to a 'treatment group'. Explain how the randomization can be done in this case. 147 www.ckl2.org 4.6 Chapter 4 Review This chapter covers the topics of data collection methods and potential sources of bias. We learned about experiments, observational studies, sample surveys and censuses. Several potential errors and sources or bias were introduced. We also learned how to use a random digit table to make random selections, how to outline experimental designs, and how to calculate and state 95% confidence intervals. You should go back and read through each of the sections in this chapter, paying careful attention to all of the new terms in bold. This will help you to do problem 1 from your homework assignment. Review Exercises 1) Study your new vocabulary! a) Make flashcards with all of the terms from this chapter. Write the term on one side of the card. On the other side, write a brief definition and include an example. b) Study your flashcards. 2) Each statement below claims that the ACT's are not a fair measurement for college readiness, but for a different reason. For each student's statement, determine whether he or she is questioning the validity, the reliability, or claiming that it will be biased. Explain your answers. a) The ACT's are not fair because it is timed and I cannot work fast enough. Consequently I am not really doing as well as I could; I always get a lower score than I should receive. b) The ACT's are not fair because the vocabulary is not clear and I do not even understand what the questions are asking me. I always study really hard and do my homework and I am totally ready for college, but that doesn't show up on some stupid test. c) The ACT's are not fair because the first time I took them I scored a 21, but the second time I scored a 16. How can that be right? 3) Suppose you want to take a simple random sample of 350 women, from a population of 4700 females on the University of Coolness campus. a) Explain the steps you would follow if you were going to make the selection using a table of random digits (be thorough). b) Starting at line 137, select the first five numbers. www.ckl2.org 148 4) Suppose that after carrying out your survey of 350 women, you found that 74 of the women said that they "did not feel safe walking on campus after dark". Identify each of the following. a) Population of interest b) Parameter of interest c) Sample d) Statistic e) Margin of error (quick method for 95%) f) Calculate a 95% confidence interval g) Make a 95% confidence statement 5) A high school social studies teacher wants to see if giving a completely multiple choice test versus a traditional free response test will improve student scores. She has designed two versions of the chapter 6 test to test her question. She has two classes - a 1 st hour class with 32 students, and a 5 th hour class with 37 students. The two classes are very different in both behavior and academic performance, so she decides to carry out her experiment using blocking. Identify the following: a) Subjects b) Explanatory variable(s) c) Response variable d) Will this experiment be blind? double-blind? placebo controlled? e) Outline a randomized block experiment. 149 www.ckl2.org 6) Suppose that you are trying to determine whether kindergarten students who have gone to child care centers show more aggressive behaviors than children who have not attended child care centers. You have the data regarding whether or not each child attended a child care center and for what length of time. You are now going to study aggressive behaviors. For each of the following, decide which type of data collection method is being proposed: observational study, sample survey, census, or experiment. a) Observers will watch the kindergartners on the playground, recording aggressive behaviors. b) A survey will be given to 20 randomly selected parents asking each to rate his or her child's behavior. c) A survey will be given to all kindergarten parents asking each to rate his or her child's behavior. d) During center time, a teacher will take a toy away from a child and record whether they act aggressively. 7) For the study in question 6 a) Suggest something that may go wrong with, or may be a source of bias, for each of the proposed data collection methods. b) Which of the methods do you feel will yield the best results? Explain. 8) A researcher at the University of Minnesota believes that a certain component of ant venom can be used to lessen the amount of swelling in the knuckles of people suffering from arthritis. The ant venom treatment has been made into a capsule form that can be swallowed, and is designed to be taken one time per day. Suppose that you have 200 people suffering from arthritis who have volunteered to participate in this study. Identify the following: a) Subjects b) Explanatory variable c) Response variable d) Will this experiment be blind? double-blind? placebo controlled? e) Outline a completely randomized design www.ckl2.org 150 9) Your teacher wants to find out whether chocolate helps students concentrate on their tests. In one class, she gives all of the students chocolate before the test begins. In another class, she does nothing. Is this an example of an observational study, a sample survey, a census, or an experiment? Give reasons to support your answer. 10) You bought a sweater on discount that was originally marked at $30. When you got to the register, it rung up as $23. What was the percent discount? 11) The cost of gas in 2001 was $1.45 per gallon. The average cost today is $3.75. a) What is the amount of increase? b) What is the percent of increase? 12) The table below shows the number of seniors and the number of seniors graduating for three high schools. School Number of Seniors Number Graduating Graduation Rate % McArthur 423 354 Meade 125 110 Eisenhower 392 379 a) Which school has the most students graduating? b) Determine the graduation rate for each school. c) Which school has the highest graduation rate? 151 www.ckl2.org 13) Identify the sampling method used in each of the following: SRS, stratified random sample, sys- tematic random sample, multi-stage random sample, random cluster sample, voluntary response sample, convenience sample a) Every fifth person boarding a plane is searched thoroughly. b) At a local community College, five math classes are randomly selected out of 20 and all of the students from each class are interviewed. c) A researcher randomly selects and interviews forty male and forty female teachers, from a university with 122 female and 135 male instructors. d) A researcher for an airline interviews all of the passengers on five randomly selected flights. e) Based on 12,500 responses from 42,000 surveys sent to its alumni, a major university esti- mated that the annual salary of its alumni was 92,500. f) A community college student interviews everyone in his biology class to determine the per- centage of students that own a car. g) A market researcher randomly selects 200 drivers under 35 years of age and 100 drivers over 35 years of age, from those insured with Quality Car Insurance. h) A researcher selects 12 states randomly. From each state, she randomly selects 20 middle schools. From each middle schools, she randomly selects 15 teachers. The 3,600 teachers were then interviewed by phone.. i) To avoid working late, the quality control manager inspects the last 10 items produced that day. j) The names of 70 contestants are written on 70 cards. The cards are placed in a bag, and three names are picked from the bag. www.ckl2.org 152 14) The athletic director wants to know how tax payers in the community feel about funding for athletics at the high school. He surveys his coaches and the parents of athletes at his school. Describe what is wrong with his methodology. 15) Suppose that a poll was commissioned to determine whether people in the U.S. believe that pro wrestling is a sport. Identify the type(s) of bias that will likely be present in each of the following scenarios. Some will have more than one. Explain your answers. a) An online poll was sent to all visitors of the WWE website. b) Telephone interviews are done to randomly selected phone numbers between 4 p.m. and 8 p.m. c) Some people were embarrassed to admit that they liked wrestling and think it is a sport. d) One of the questions asked was, "Do you believe that the pro wrestlers are actors or should they be considered serious athletes?" e) One of the researchers spilled coffee on a big stack of surveys and several had to be thrown away. f) A second poll done had slightly different results. g) WWE had fans fill out a survey as they left a pro wrestling event. Picture of cell phones, http://www.w-cellphones.com July 19, 2011. Students measuring heights, http://t3.gstatic.com July 25, 2011. Pop can. http://popartmachine.com July 25, 2011. Calvin and Hobbes. http://www.stat.psu.edu/old_resources/Cartoons/cartoon014.gif July 27, 2011. Clipboard, http://boylston.bbrsd.schoolfusion.us July 25, 2011. Hockey Players. http://images.paraorkut.com July 25, 2011. Scientist:http://www. reversingibs.com July 15, 2011. 153 www.ckl2.org Chapter 5 Analyzing Univariate Data Introduction So now that we have discussed some methods for collecting data we can look at what to do with those findings. Whether you have collected categorical or numerical data you will want to choose an appropriate type of graphical display so that you can see the data. Charts and graphs of various types, when created carefully, can provide important information about a data set. You will also need to analyze the data with numerical and summary statistics. Once you have constructed a graphical display and have calculated numerical statistics, it will be necessary to describe your findings verbally. Statisticians, such as yourself, then make appropriate conclusions and comparisons based on the data and statistics, avoiding opinions and judgment statements. This chapter will concentrate on some of the more common visual presentations of data, numerical analysis of data, and verbal descriptions of data. 5.1 Categorical Data Organize categorical data in tables Construct bar graphs and pie charts by hand and with computer software programs Describe, summarize and compare categorical data www.ckl2.org 154 Each student in the class should complete the following survey. The data collected will be used in your homework problems. Notice that the variables in each question are categorical. 1. What is your gender? choose one o Female o Male 2. Wh at is you r favorite season ? ch o a se a n e a Winter o Spring a Summer a Fail 3. Which of these is your favorite type of food? choose one a Italian o Asian o Mexican a American 4. What type Ofpet(s) do yOU have? Choose all that apply q Dog a Cat o Fish a Reptile o Rodent a Other a None 155 www.ckl2.org Frequency Tables and Bar Graphs (Note- Two videos for this section) ***** ***** t3 £$ When analyzing categorical data (also called qualitative data), bar graphs are commonly used. A bar graph is a graph in which each bar shows how frequently a given category occurs. It is usually helpful to organize the data in a frequency table, a table that shows the number of occurrences for each category, before constructing the bar graph. The bars can go either horizontally or vertically, they should be of consistent width, and need to be equally spaced apart. The categories are separate and can be put in any order along the axis. It is common to put them in alphabetical order, but not needed. And, as with all of the graphs you will construct, be sure to use a consistent scale, include a title, labels for axes, numbers to mark axes as necessary, and a key whenever needed. Example 1 A bar graph could show the types of pets of a group of students for example. Here are the types of pets owned by a class of 33 geometry students. a) Why do the numbers add up to more than 33? b) Construct a bar graph to show this class' data. c) Describe what the graph shows. Type of Pet #of Students Dog 14 Cat 8 Fish 3 Reptile 2 Rodent 5 Other 2 None 7 Solution a) They add up to more than 33 because some students own more than one type of pet and are being counted in more than one category. b) Here is a bar graph that was created using Excel: www.ckl2.org 156 Class Pets None Other J] Rodent o Reptile a t Fsh Cat Dog 5 10 Number of Students 15 c) For this class, the most common pet is a dog. Fourteen students, or J^2% of the class, own a dog. Having a cat, or no pet at all are the next most common events. Five students own some type of rodent, two have reptiles for pets, and three have fish. There are also two students who own some other type of pet. Example 2 A great deal of electronic equipment ends up in landfills as people update their computers, TVs, cell phones, etc. This is a concern because the chemicals from batteries and other electronics add toxins to the environment. This Electronic Waste has been studied in an effort to decrease the amount of pollution and hazardous waste. The following frequency table shows the amount of tonnage of the most common types of electronic equipment discarded in the United States in 2005. Construct a bar graph and comment on what it shows. Table 5.1: Electronic Equipment Thousands of Tons Discarded Cathode Ray Tube (CRT) TV's CRT Monitors Printers, Keyboards, Mice Desktop Computers Laptop Computers Projection TV's Cell Phones LCD Monitors 7591.1 389.8 324.9 259.5 30.8 132.8 11.7 4.9 Electronics Discarded in the US (2005). Source: National Geographic, January 2008. Volume 213 No.l, pg 73. 157 www.ckl2.org Solution The type of electronic equipment is a categorical variable, and therefore, this data can easily be represented using the bar graph below: -a a "E .2 a (a E .o ■o 8000 7000 5000 5000 4000 3000 2000 1000 Electronic Waste «!* C» fc*» & o s A^ O Jr A & f &> ^ & jF 0* ? / «* A' ■v ■AT According to this 2005 data, the most commonly disposed of electronic equipment was CRT TV's, by more than 19 times that of the next type of electronic equipment. Pie Charts ***** £3 Pie charts (or circle graphs) are used extensively in statistics. These graphs are used to display categorical data and appear often in newspapers and magazines. A pie chart shows each category (sectors) as a part of the whole (circle). The relationships between the parts, and to the whole, are visible in a pie chart, by comparing the sizes of the sectors (slices). Constructing a pie chart uses the fact that the whole of anything is equal to 100%-all of the sectors equal the whole circle. Remember from geometry that the central angles of a circle total 360°. So, in regard to pie charts, 360° = 100% of the circle . The sections should have different colors or patters to enable an observer to clearly see the difference in size of each section. Pie charts are the appropriate choice when you are working with categorical data that covers 100%. It is not an appropriate choice when you aren't working with 100% or when choices may include overlaps. For www.ckl2.org 158 example, when we asked every student in this class to list the pets they currently have, we found some students who have more than one pet. So a pie chart would not be an appropriate way to display that data. The sectors in a circle graph do not allow for overlaps such as this. Another time when pie charts are not appropriate is when the choices do not cover all possibilities. For example, the electronic waste example above does not include every possibility, so the categories would not add to 100%. In such cases a bar graph would be a more appropriate choice, because it allows for overlaps and does not need to cover exactly 100% of the choices. Example 3: How to Construct a Pie Chart The Red Cross Blood Donor Clinic had a very successful morning collecting blood donations. Within three hours twenty-five people had made donations. The types of blood dontated are: Table 5.2: Blood Type A B O AB Number of donors 7 5 9 4 Construct a pie chart to represent the data. Solution Step 1: Determine the total number of donors. 7 + 5 + 9 + 4 = 25 Step 2: Express each donor number as a percent of the whole by using the formula where / is the frequency and n is the total number. / Percent = - ■ 100% n ^r-100% = 28% A. 100% = 20% ^ • 100% = 36% ^ • 100% = 16% ZtO Zo Zo ZiO Step 3: Express each donor number as the number of degrees of a circle that it represents by using the formula / Degree = — ■ 360 n where / is the frequency and n is the total number. — • 360° = 100.8° — • 360° = 72° — • 360° = 129.6° — • 360° = 57.6° 25 25 25 25 Step 4: Using a protractor or technology to make the central angles, graph each section of the circle. Step 5: Write the label and correct percentage inside the section. Color each section a different color. Be sure to include a title, and a key if needed. 159 www.ckl2.org Blood Types of 25 Donors From the graph, you can see that more donations were of Type O than any other type. The least amount of blood collected was of Type AB. In order to create a pie graph by using the circle, it is necessary to use the percent of a section to compute the correct degree measure for the central angle. The blood type graph labels each section with context and percent, and not the degrees. This is because degrees would not be meaningful to an observer trying to interpret the graph. If the sections are not labeled directly as they are in that example, it is necessary to include a key so that the observers will know what each section represents. Graphs on Computer Software The above pie chart could be created by using a protractor and graphing each section of the circle according to the number of degrees needed for each section. However, bar graphs and pie charts are most frequently made with computer software programs such as Excel or Google Docs, if you would like to learn how to do this on Excel, click here. You will be asked to create bar graphs and pie charts using computer software. When you do this, be sure to include titles, labels, and keys when needed. Be sure to 'fix' the graph generated by the software program so that it looks the way you want it to look and shows clearly what ever it is you are trying to convey. Example 4 Comment on what the graph shows: www.ckl2.org 160 People who like different fruits 10% 35% 20% 25% □ Apples □ Cherries □ Grapes □ Others □ Bananas □ Dates Solution Several people were asked to choose their favorite fruits from a list of six options. Apples were the favorite choice with 35% of the participants choosing them. The second favorite fruit was cherries at 25%, followed by grapes with 20%. Ten percent of the people said that dates were their favorite fruit. However, only 7% chose bananas from the choices provided and the remaining 3% liked some fruit other than those listed. Pictographs Another type of graph that is sometimes used to display categorical data is a pictograph. A pictograph is basically a bar graph with pictures instead of bars. A problem with pictures in graphs is that the area that they take up can mislead the observer. The width and height both increase as the picture gets larger. Pictographs are often used, but should generally be avoided in serious statistical representations. Example 5 The following graph compares the number of wins for high school football teams during the 2010 seasons. Explain why the pictograph is misleading. 161 www.ckl2.org Football Team Wins by School 2010 Season Kennedy Washington Solution The pictures increased in both height and width. So when something should be doubled, it actually looks four times as big. For example, when comparing the number of wins between Eisenhower and Adams the graph should show 4 times as many wins. However, in this pictograph it looks as though Adams had 16 times as many wins (4 times as wide X 4 times as tall). www.ckl2.org 162 Section 5.1 Exercises 1) Computer equipment contains many elements and chemicals that are either hazardous, or potentially valuable when recycled. The following data set shows the contents of a typical desktop computer weighing approximately 27 kg. Some of the more hazardous substances, like Mercury, have been included in the "other" category, because they occur in relatively small amounts that are still dangerous and toxic. Table 5.3: Material Kilograms Plastics 6.21 Lead 1.71 Aluminum 3.83 Iron 5.54 Copper 2.12 Tin 0.27 Zinc 0.60 Nickel 0.23 Barium 0.05 Other elements and chemicals 6.44 Figure: Weight of materials that make up the total weight of a typical desktop computer. Source: http://dste.puducherry.gov. a) Create a bar graph for this data. b) Complete the chart below to show the approximate percentage of the total weight for each material and the central angle needed. c) Why is a pie chart also appropriate for this example? d) Create a pie chart for this data. Table 5.4: Material Kilograms Approximate % Central Angle of Total Weight for Pie Graph Plastics 6.21 Lead 1.71 Aluminum 3.83 Iron 5.54 Copper 2.12 Tin 0.27 Zinc 0.60 Nickel 0.23 Barium 0.05 Other elements 6.44 163 www.ckl2.org 2) Based on what you can see in the graph, write a brief description of what it is showing. This should be at least three sentences and in context. 900000 800000 700000 600000 500000 400000 300000 200000 1 00000 North American Bear Population, 2007 Black Bear Brown Bear Polar Bear Source. http://www.niathworksheetscenter.com Aug. 5, 2011. 3) Type of Pet? a) Construct a frequency table to show the Type of Pet data from our class. b) Use Excel or Google Docs to create a bar graph that shows the types of pets the students in our class have. c) Write a brief description of what your graph shows. 4) Favorite Season? a) Construct a frequency table to show the Favorite Season data from our class. b) Use Excel or Google Docs to create a pie chart that shows the favorite season of the year for the students in our class. c) Write a brief description of what your graph shows. www.ckl2.org 164 5) Look at the school lunch graph that was created by some students: Our Favorite School Lunch n felted .U-iiL'.Ui-VM £*&+ £\fe, £**? ^gSk 4dk £& J^i* ( c&j r&r imi ^mi Each Pictur«= 2 Students a) In what way is this graphical representation misleading? Explain. b) Create a better graphical representation for this same data. 6) Favorite Foods? a) Construct a frequency table to show the Favorite Food data separately for males and females for our class. b) Use Excel or Google Docs to create two pie charts that compare the favorite food types for the boys and girls in our class. The charts should 'match' as much as possible- they should be the same size and use the same colors, fonts, etc. c) Write a brief description comparing the boys and girls choices for favorite food. Look for similarities and differences. 165 www.ckl2.org 7) The following table has Minnesota Wild statistics for 2010-2011, for some of the Wild players. Thirteen variables are listed across the top and have been highlighted. a) Identify the individuals. b) Identify what each variable is (Example GP = games played). You may need to do some research. c) Classify each variable as numerical or categorical? 2010-2011 REGULAR SEASON Forwards & Defensemen # POS PLAYER GP G A P +/- PIM PP SH GW S S% I I 24 R MARTIN HAVLAT 78 22 40 62 -10 52 3 4 229 9.6 15 8 7 96 21 20 R 22 R MIKKO KOIVU ANDREW BRUNETTE I BRENT BURNS MATT CULLEN PIERRE-MARC BOUCHARD KYLE BRODZIAK ANTTI MIETTINEN CAL CLUTTERBUCK 71 17 45 62 4 50 7 82 18 28 46 -7 16 8 80 17 29 46 -10 98 8 78 12 27 39 -14 34 5 59 12 26 38 -3 14 80 16 21 73 16 19 37 35 56 33 76 19 15 34 -5 79 3 3 191 8.9 3 117 15.4 3 170 10.0 2 1 50 8.0 98 126 168 12.2 12.7 9.5 191 9.9 www.ckl2.org 166 5.2 Time Plots & Measures of Central Tendency Learning Objectives • Construct time plots • Describe trends in time plots • Calculate range and measures of central tendency: mean, median, mode • Understand how a change in the data will effect the statistics Line Graphs as Time Plots We are often interested in how something has changed over time. The type of graphical display that shows this the most clearly is a time plot, or line graph. When one of the variables is time, it will almost always be plotted along the horizontal axis (as the explanatory variable). Because time is a continuous variable and we are trying to see if there is some type of trend in how the other variable (response) has behaved over a period of time, a line graph is often very useful in showing this relationship. Example 1 The total municipal waste generated in the US by year is shown in the following data set. a) Construct a time plot to show the change in the amount of municipal waste generated in the United States during the 1990's. b) Comment on the trend that is shown in the graph. c) Suggest factors (other than time) that may be leading to this trend. Table 5.5: Year Municipal Waste Generated (Millions of Tons) 1990 269 1991 294 1992 281 1993 292 1994 307 1995 323 1996 327 1997 327 1998 340 167 www.ckl2.org Source: http://www.zerowasteanierica.org www.ckl2.org 168 Solution a) In this example, the time (in years) is considered the explanatory variable, and is graphed along the horizontal axis. The amount of municipal waste is the response variable, and is graphed along the vertical axis. Time plots can be drawn by hand, graph paper makes this easier, or created with computer software programs, or graphing calculators. This example was made using Excel. Municipal Waste Generated (millions of tons) 350 3O0 250 200 150 100 50 o 1983 199C 1992 .99- 1996 1993 "i 2CC0 b) This graph shows that the amount of municipal waste generated in the United States increased at a fairly steady rate during the 1990s. Between 1991 and 1992 there was a decrease of 13 million tons of municipal waste, but every other year during the 1990s had an increase. c) It should be noted that factors other than the passage of time cause our waste to increase. Other factors, such as population growth, economic conditions, and societal habits and attitudes also contribute as causes. Example 2 Here is a line graph that shows how the hourly minimum wage changed from when it was first mandated through 1999. a) During which decade did the hourly wage increase by the greatest amount? b) During which decade did it increase the most times? c) When did it stay constant for the longest? 169 www.ckl2.org The Federal Hourly Minimum Wage Since Its Inception 6-1 1-6 5- r 2- 1- -5 -4 -3 -2 -1 « 9 cn en 9 Cn 9 M Year Source. http://mste.illinois.edu Aug 1,2011. Solution a) The greatest increase appears to have happened during the 1990s, when it went from ~$3.75 to ~$5.20. b) The 1970s appear to have had 5 or 6 increases in the minimum wage. c) The longest constant minimum wage was during the 1980s. Measures of Central Tendency & Spread Mean The mean, often called the 'average' of a numerical set of data, is the sum of all of the numbers divided by the number of values in the data set. This value is the arithmetic mean, and it tells us what value we would have if all of the data were the same. The mean is the balance point of a distribution, and is one of the three measure of central tendency commonly used in statistics. The mean, the median, and the mode www.ckl2.org 170 are all measures of central tendency. They all show where the center of a set of data "tends" to be. Each one is useful at different times. The mean is a summary statistic that gives you a description of the entire data set and is especially useful with large data sets where you might not have the time to examine every single value. However, the mean is affected by extreme values, called outliers, and can end up leaving the observer with the wrong impression of a data set. Imagine if we were to report the average salary of all employees of Burger Boy, including the main manager, the mean salary would give the impression that pay at Burger Boy is pretty high. But, if we removed the manager's salary from the calculation we would see a much different 'average' salary. Median The median is the number in the middle position once the data has been organized. Organized data is simply the numbers arranged from smallest to largest or from largest to smallest. This is the only number for which there are as many above it as below it in the set of organized data, and is referred to as the equal areas point. The median, for an odd number of data, is the value that is exactly in the middle of the ordered list, it divides the data into two halves. The median for an even number of data, is the mean of the two values in the middle of the ordered list. The median is useful when there are a few extreme values that can effect the mean, because the middle number will stay in the middle. The median often gives a good impression of the center, because there are 50% of the values above the median, 50% of the values below the median, and it doesn't matter how big the biggest values are or how small the smallest values are. Mode The mode of a set of data is simply the number that appears most frequently in the set. There are no calculations required to find the mode of a data set. You simply need to look for it. However, be aware that it is common for a set of data to have no mode, one mode, two modes or more than two modes. If there is more than one mode, simply list them all. And, if there is no mode, write 'no mode'. No matter how many modes, the same set of data will have only one mean and only one median. The mode is a measure of central tendency that is simple to locate but is not used much in practical applications. It is the only one of these three values that can be for either categorical or numerical data. Remember the example regarding pets? The mode was 'dogs' because that was the most common response. Range The range of a data set describes how spread out the data is. To calculate the range, subtract the smallest value from the largest value (maximum value - minimum value = range). This value provides information about a data set that we cannot see from only the mean, median, or mode. For example, two students may both have a quiz average of 75%, but one of them may have scores ranging from 70% to 82% while the other may have scores ranging from 24% to 90%. In a case such as this, the mean would make the students appear to be achieving at the same level, when in reality one of them is much more consistent than the other. #*" 171 www.ckl2.org Example 3 Stephen has been working at Wendy's for 15 months. The following numbers are the number of hours that Stephen worked at Wendy's during the past seven months: 24,24,31,50,53,66,78 What is the mean number of hours that Stephen worked per month? Solution Stephen has worked at Wendy's for 15 months but the numbers given above are for seven months. Therefore, this set of data represents a sample of the population. The formula that is used to calculate the mean for a sample and for a population is the same. However, the symbols are different. The mean of a sample is denoted by x which is called "x bar". The mean of an entire population is denoted by // which is the Greek letter "mu" (pronounced "myoo"). The number of data for a sample is written as n. The following formula represents the steps that are involved in calculating the mean of a sample: add the numbers Mean the number of numbers This formula can now be written using symbols. X\ + x 2 + x 3 + . . . + x„ X = You can now use the formula to calculate the mean of the hours that Stephen worked. Xl + X 2 + X 3 + . . . + X n X n _ 24 + 25 + 33 + 50 + 53 + 66 + 78 X= 7 __ 329 x ~ ~Y x = 47 The mean number of hours that Stephen worked during this time period was Jf.1 hours per month. www.ckl2.org 172 Example 4 The ages of several randomly selected customers at a coffee shop were recorded. Calculate the mean, median, mode, and range for this data. 23, 21, 29, 24, 31, 21, 27, 23, 24, 32, 33, 19 Solution mean: (23 + 21 + 29 + 24 + 31 + 21 + 27 + 23 + 24 + 32 + 33 + 19) /12 = 307/12 307/12 = 25.58 median: first, organize the ages in ascending order 19,21,21,23,23,24,24,27,29,31,32,33 second, count in to find the middle value 24, 24 the middle value will be halfway between these 2 two values (or the average of these two values) — i — = 24 mode: look for the value(s) that occur most frequently 21,23,24 this data set has three modes range: subtract the smallest value from the largest value (max - min = range) 33 - 19 = 14 Solution: make your conclusion in context At this coffee shop, the mean age of people in this sample is 25.58 years old and the median age is 24 years old. There were three modes for age at 21, 23, and 24 yeas old and the range for ages is 14 years. f* Example 5 Lulu is obsessing over her grade in health class. She just simply cannot get anything lower than an A-,or she will cry! She knows that the grade will be based on her average (mean) test grade and that there will be a total of six tests. They have taken five so far, and she has received 85%, 95%, 77%, 89%, and 94% on those five tests. The third test did not go well, and she is getting worried. The cutoff score for an A is 93%, and 90% is the cuttoff score for an A-. She wants to know what she has to get on the last test. The teacher assures her that she will round to the nearest whole percent. a) What is the lowest grade Lulu will need to get on the last test in order to get an A in health? b) What is the lowest grade Lulu will need to get on the last test in order to get an A- in health? 173 www.ckl2.org Solution a) So she sets up an equation thinking about how she would calculate her average test grade if she knew all six scores. Knowing that she wants the final average to equal 93%, she puts an 'a;' in the place of the last test score, and then does some algebra to solve for x. 85+95+77+89+94+* _ gg (85 + 95 + 77 + 89 + 94 + x) = 93 * 6 85 + 95 + 77 + 89 + 94 + x = 558 440 + x = 558 x= 118 Oh no! There is no way she can get 118%. So, there is no possible hope for her to get an A. b) It is time to try for an A-, but that 118% scared her, so she is going to think of the lowest possible score that will still be an A-. With rounding, if she can get her mean score to 89.5%, she will make it. So she tries the same algebra, but with 89.5 as the final result. 85+95+77+89+94+x = gg 5 (440 + x) = 89.5 * 6 440 + x = 537 x = 97 There is hope! As long as she gets a 97% or higher on this last test, she can get an A-. She is going to study like crazy! Section 5.2 Exercises 1) Determine the mean, median, mode and range for each of the following sets of numbers: a) 20, 14, 54, 16, 38, 64 b) 22, 51, 64, 76, 29, 22, 48 www.ckl2.org 174 c) 40, 61, 95, 79, 9, 50, 80, 63, 109, 42 2) The mean weight of five men is 167.2 pounds. The weights of four of the men are 158.4 pounds, 162.8 pounds, 165 pounds and 178.2 pounds. What is the weight of the fifth man? 3) The mean height of 12 boys is 5.1 feet. The mean height of 8 girls is 4.8 feet. a) What is the total height of the boys? b) What is the total height of the girls? c) What is the mean height of the 20 boys and girls all together? 4) The following data represents the number of advertisements received by ten families during the past month. Make a statement describing the 'typical' number of advertisements received by each family during the month. Be sure to include statistics to support your statement. 43 37 35 30 41 23 33 31 16 21 5) Mica's chemistry teacher bases grades on the average of each student's test scores during the trimester. Mica has been kind of slacking this year, but hasn't been too concerned because he knows that he will at least get the credit (60% = passing). However, his parents just informed him that he will not be allowed to use the car if he has any grade below a C (73%). Here are Mica's chemistry test scores for the first eight chapters: 10,70,71,82,65,76,58,75 a) Calculate the mean, median, mode, and range for Mica's chemistry tests. What grade will Mica receive in chemistry based on this? b) His teacher has decided that each student may retake any one of his or her tests in an effort to improve his or her grade. Mica jumps at this opportunity, studies chapter one for hours and retakes the test. To his, and his mother's delight, his 10% turns into a 70%!! Woo-hoo! Calculate the mean, median, mode, and range for Mica after this change. Which of these values changed? Which did not? What grade will Mica receive now? c) If Mica continues to study and earns an 60% on the chapter 9 test and a 76% on the chapter 10 test, what will his final average be? d) If Mica continues to study and earns an 85% on the chapter 9 test and a 90% on the chapter 10 test, what will his final average be? 175 www.ckl2.org 6) Deals on Wheels: The following table lists the retail price and the dealer's costs for 10 cars at a local car lot this past year: Table 5.6: Car Model Retail Price Dealer's Cost Amount of Mark-Up Percent of Mark-Up Nissan Sentra $24,500 $18,750 Ford Fusion $26,450 $21,300 Hyundai Elantra $22,660 $19,900 Chevrolet Malibu $25,200 $22,100 Pontiac Sunfire $16,725 $14,225 Mazda 5 $27,600 $22,150 Toyota Corolla $14,280 $13,000 Honda Accord $28,500 $25,370 Volkswagen Jetta $29,700 $27,350 Subaru Outback $32,450 $28,775 a) Calculate the amount each car was marked up. b) Calculate the percent that each car was marked up J^aier-'Jost * 100% rounded to the nearest one-tenth of a percent. Report answers c) Calculate the mean, median, mode and range for the percent of mark-up. d) Do the "amount of mark-up column" and the "percent of mark-up column" put the cars in the same order for profit? Explain or give an example. www.ckl2.org 176 7) Write a brief description of what the line graph for Platinum Prices shows. Be sure that you do this in context, as complete sentences, and that you include at least three observations. Line Graph: Platinum Prices, 1960 to 2005 The line graph shows the price of platinum per ounce in US dollars between 1960 and 2005 1000 900 of 300 Q. s 700 600 500 400 300 200 1 00 Platinum Price (US Dollars) Per Ounce liur - Vr*» *~W 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 Source. http://www.admc.hct.ac.ae 8) According to the U.S. Census Bureau, "household median income" is defined as "the amount which divides the income distribution into two equal groups, half having income above that amount, and half having income below that amount." The table shows the median household income data, every 3 years, from 1975 until 2008, according to the U.S. Census Bureau. 1975 1978 1981 1984 1987 1990 1993 1996 1999 2002 2005 2008 $11,800 $15,064 $19,074 $22,415 $26,061 $29,943 $31,241 $35,492 $40,696 $42,409 $46,326 $50,303 a) Construct a time plot for the median household data. You may do this by hand on graph paper, or by using technology. b) Write a brief description of what the line plot shows. This should be done as complete sentences, in context of the distribution, and should include at least three distinct observations. g" 177 www.ckl2.org 5.3 Numerical Data: Dot Plots & Stem Plots Learning Objectives • Construct dot plots, stem plots and split- stem plots • Calculate numerical statistics for quantitative data • Identify potential outliers in a distribution • Describe distributions in context- including shape, outliers, center, and spread Dot Plots One convenient way to organize numerical data is a dot plot. A dot plot is a simple display that places a dot (or X, or another symbol) above an axis for each datum value (datum is the singular of data). The axis should cover the entire range of the data, even numbers that will have no data marked above them should be included to show outliers or gaps. There is a dot for each value, so values that occur more than once will be shown by stacked dots. Dot plots are especially useful when you are working with a small set of data across a reasonably small range of values. This type of graph gives a clear view of the shape, any mode(s) and the range of a set of data. The numbers are already in order, so finding the median is fairly quick. And any outliers are quickly visible. Ages of all of the Sales People at Stinky's Car Dealership. * -i — I I I— I— I I I I I I I I I I I l_ 45 47 46 51 65 55 57 59 51 53 55 57 69 Describing a Numerical Distribution Shape Once a graphical display is constructed, we can describe the distribution. When describing the distribution, we should be sure to address its shape. Although many graphs will not have a clear or exact shape, we can usually identify the shape as symmetrical or skewed. A symmetrical distribution will have a middle where we can draw an imaginary line through the center, and a fairly equal "look" on either side of that imaginary line. If you were to fold along the imaginary center line, the two sides would almost match up. Many symmetrical distributions are bell shaped, they will be tall in the middle with the two sides thinning www.ckl2.org 178 out. The sides are referred to as tails. A skewed distribution is one in which the bulk of the data is concentrated on one end, with the other side being a longer tail. The direction of the longer tail is the direction of the skew. Skewed right will have a longer tail to the right, or higher numbers. Skewed left will have a longer tail off to the left, or the lower values. Other shapes that you might see are uniform (almost consistent height all the way across) and bimodal (having two peaks in the distribution). \L- Symmetric Bell shaped Skewed to the LeA Skewed to the Right Outliers If there are any outliers, gaps, groupings, or other unusual features in the distribution, we should be sure to mention them. An outlier is a value that does not fit with the rest of the data. Some distributions will have several outliers, while others will not have any. We should always look for outliers because they can affect many of our statistics. Also, sometimes an outlier is actually an error that needs to be corrected. If you have ever 'bombed' one test in a class, you probably discovered that it had a big impact on your overall average in that class. This is because the mean will be affected by an outlier-it will be pulled toward it. This is another reason why we should be sure to look at the data, not just look at the statistics about the data. When an outlier is part of the data and we do not realize it, we can be misled by the mean to believe that the numbers are higher or lower than they really are. Context Do not forget that the graph, the numbers and the descriptions are all about something-its context. All of these elements of the distribution should be described in the context of the situation in question. Center The center of the distribution should always be included in the verbal analysis as well. People often wonder what the 'average is'. The measure for center can be reported as the median, the mean, or the mode. Even better, give more than one of these in your description. Remember that outliers effect the mean, but 179 www.ckl2.org do not effect the median. For example, the median of a list of data will stay in the center even when the largest value increases tremendously, but such a change would effect the mean quite a bit. Spread Another thing to include in the description is the spread of the data. The spread is the specific range of the data. When analyzing a distribution, we don't want to simply say that the range is equal to some number. It is much more informative to say that the data ranges from to (minimum value to maximum value). For example, if the news reports that the temperature in St. Paul had a range of 20° during a given week, this could mean very different temperatures depending on the time of year. It would be more informative to say something specific like, the temperature in St. Paul ranged from 68° to 88° last week. So, when you describe the distribution of a numerical variable, there are several things to include. This text will use the acronym S.O.C.C.S! (shape, outliers, center, context, spread) to help us remember what characteristics to include in our descriptions. Example 1 An anthropology instructor at the community college is interested in analyzing the age distribution of her students. The students in her Anthropology 102 class are: 21, 23, 25, 26, 25, 24, 26, 19, 18, 19, 26, 28, 24, 22, 24, 19, 23, 24, 24, 21, 23, and 28 years old. Organize the data in a dot plot. Calculate the mean, median, mode, and range for the distribution. Describe the distribution. Be sure to include the shape, outliers, center, context, and spread. a) construct a dot plot X X X X X X X X X X X X X X X X X X X X X X 18 19 20 21 22 23 24 25 26 27 28 Ages of Students in Anthropology 102 b) mean-(18+19+19+19+21+21+22+23+23+23+24+24+24+24+24+25+25+26+26+26+28+28)/22 = 23.2727... mean = x = 23.27 years old median- already listed in order, count to find "middle number", it is between 24 and 24, find mean of these two numbers (24+24) /2=24 median = Med = 24 years old mode- look for most frequent age, it is 24 mode = 24 years old www.ckl2.org 180 range- min age is 18, max age is 28 range is 28 - 18 years 10 years or ages range from 18 to 28 c) describe- address the shape, outliers, center, context, and spread of the distribution (This could be described as fairly symmetrical or slightly skewed to the left) The distribution of student ages in this Anthropology 102 class is fairly symmetrical with no clear ouliers. The ages of students range from 18 to 28 years old. The median and mode for age are both 24 years old and the mean is 23.27 years. Thus, the typical student in this class is 23-24 years of age. Stem Plots In statistics, data is represented in tables, charts or graphs. One disadvantage of representing data in these ways is that the specific data values are often not retained. Using a stem plot is one way to ensure that the data values are kept intact. A stem plot is a method of organizing the data that includes sorting the data and graphing it at the same time. This type of graph uses the stem as the leading part of the data value and the leaf as the remaining part of the value. The result is a graph that displays the sorted data in groups or classes. A stem plot is used with numerical data when it will be helpful to see the actual values organized in order. To construct a stem plot you must first determine the range of your distribution. Build the stems so that they cover the entire range, include every stem even if it will have no values after it. This will allow us to see the true shape of the distribution including outliers, whether it is skewed, and any gaps. Then place all of the "leaves" after the appropriate stems. Place the numbers in ascending order out and include all values, so repeats will show more than once. Some people like to put the numbers in order before they construct the stem plot, some like to try to put them in order as they make the plot, and others like to make a rough draft first without regard to order and then to make a final copy with the numbers in the correct order. Any of these methods will result in a correct stem plot. Stems Leaves 2 6 1 3 C5 ^\5)8 3 4 4 9 4 (5 |9 6 1 5 7 Means 25 Means 55 Example 2 A researcher was studying the growth of a certain plant. She planted 25 seeds and kept watering, sunlight, and temperature as consistent as possible. The following numbers represent the growth (in centimeters) 181 www.ckl2.org of the plants after 28 days. a) Construct a stem plot b) Describe the distribution. Solution 18 10 37 36 61 39 41 49 50 52 57 53 51 57 39 48 56 33 36 19 30 41 51 38 60 a) Construct a stem plot-the stem plot on the left was the first draft, the one on the right has the numbers in the correct order (ascending as you go out) key 4/9 = 49cm Plant height in CM Plant height in CM Stem Leaf 1 8,0,9 2 3 7, 6, 9, 9, 3, 6, 0, 8 4 1,9,8,1 5 0,2,7,3,1,7,6,1 6 1,0 Stem Leaf 1 0,8,9 2 3 0, 3, 6, 6, 7, 8, 9, 9 4 1,1,8,9 5 0,1,1,2,3,6,7,7 6 0,1 b) Describe the distribution- Be sure to address shape, outliers, center, context, & spread. The distribution of growth at 28 days ranged from 10 to 61 centimeters for these plants with the majority of plants growing at least 30cm. The median height was ^icra after 28 days. The shape is bimodal and there is a gap in the distribution because there are no plants in the 20-29 cm class. There are some possible low outliers, but no high outliers for plant growth. www.ckl2.org 182 Example 3 Sometimes a stem plot ends up looking too crowded. When the data is concentrated in a few rows, or 'classes', it can be difficult to determine what the shape is or whether there are any outliers in the data. In this example, the stem plot for the ages of a group of people was really concentrated in the 30s and 40s (plot on left). However, the statistician looking at this was not satisfied with the crowded appearance, so she decided to 'split' the stems. The resulting graph on the right, called a split-stem plot, shows very different results. Describe the distribution based on the split-stem plot. STEM PLOT FOR AGES 2 3 4 5 2 334556777899 0144577778889 12 2 4 SPLIT-STEM PLOT FOR AGES 2 2 2 3 3 3 4 3 556777899 4 14 4 4 577778889 5 12 2 4 Solution To split the stems, each stem was written twice. The top one is for the first half of the leaves in that class, and the second one is for the leaves in the second half of that class. For example the first stem of 4 gets 40 to 44, and the second 4 gets 45 to 49. So, when splitting stems into two, the number 5 is the cutoff for moving into the second part (just like rounding). The split-stem plot shows that the distribution of ages in this example is bimodal and skewed to the left (lower numbers). It also shows that the ages of 20 and 22 appear to be low outliers. None of this was visible in the regular stem plot. Both plots show that the ages range from 20 to 54 years, with a median age of 41 years old and a mode age of 47 years old. 183 www.ckl2.org Section 5.3 Exercises 1) The following is data representing the percentage of paper packaging manufactured from recycled ma- terials for a select group of countries. Table 5.7: Percentage of the paper packaging used in a country that is recycled. Source: National Geographic, January 2008. Volume 213 No.l, pg 86-87. Country % of Paper Packaging Recycled Estonia 34 New Zealand 40 Poland 40 Cyprus 42 Portugal 56 United States 59 Italy 62 Spain 63 Australia 66 Greece 70 Finland 70 Ireland 70 Netherlands 70 Sweden 76 France 76 Germany 83 Austria 83 Belgium 83 Japan 98 The dot plot for this data would look like this: O O o • o o o O OO O O OQ o o o o o - 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 — 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 percentage a) Calculate the mean, median, mode, and range for this set of data b) Describe the distribution in context. Remember your S.O.C.C.S! www.ckl2.org 184 2) At the local veterinarian school, the number of animals treated each day over a period of 20 days was recorded. 28 34 23 35 16 17 47 05 60 26 39 35 47 35 38 35 55 47 54 48 a) Construct a stem plot for the data b) Describe the distribution thoroughly. Remember your S.O.C.C.S! 3) The following table gives the percentages of municipal waste recycled by state in the United States, including the District of Columbia, in 1998. Data was not available for Idaho or Texas. State Percentage State Percentage State Percentage AL 23 LA 14 OH 19 AK 7 ME 41 OK 12 AZ 18 MD 29 OR 28 AR 36 MA 33 PA 26 CA 30 MI 25 RI 23 CO 18 MN 42 SC 34 CT 23 MS 13 SD 42 DE 31 MO 33 TN 40 DC 8 MT 5 TX NA FL 40 NE 27 UT 19 GA 33 NV 15 VT 30 HI 25 NH 25 VA 35 ID NA NJ 45 WA 48 IL 28 NM 12 WV 20 IN 23 NY 39 WI 36 IA 32 NC 26 WY 5 KS 11 ND 21 KY 28 Source: http://www.zerowasteamerica.org 185 www.ckl2.org a) Create a split-stem plot for the data (there are 16 numbers in the 20-29 range) b) Find the median percentage for this data. c) Which of the two plots you made gives a better view of the data? Explain. d) Describe the distribution thoroughly. Remember your S.O.C.C.S! Specifically identify any states that stand out. 4) This stem plot is one that looks too crowded. Stem Leaves 6 4 7 8 9 7 02222233 34 44556667 8 112 2 2 represents 64 a) Create a split-stem plot for this example. b) Name at least two things that are visible in the second plot that were not apparent in the first plot. c) Invent a scenario that this data could represent. www.ckl2.org 186 5) Several game critics rated the Wow So Fit game, on a scale of 1 to 100 (100 being the highest rating The results are presented in this stem plot: S 2 3 5 7 4 5 Key: 6 | 7 = 67 8 9 2 2 4 6 7 7 9 002567799 1122356679 2 2 2 6 7 a) Find the three measures of central tendency for the game rating data. b) Which of these three measures of central tendency gives the best impression of the 'average' (typical) rating for this game? Explain. 187 www.ckl2.org 6) These dot plots do not have any numbers or context. For each of the following dot plots: a) Identify the important features of the distribution (shape, outliers, center, spread). b) Suggest a possible variable that might have such a distribution. (In other words, invent a context that fits the graph.) i) o oooo oooooo ooo oooo ooo oooooooo OOP oooooooooooooooo 1 1 r ii) o CDO O oo oooooo o O GODOOOOQQOOO O OOG0DOOCQ>CDG0OOO O O m ) o o <D O o OOOOOOO o o o o o oo o o o o oooo o o ooooooooooooooo o T T iv) o o oo o OCDOO O oooooo o o oooo o ooo o T oooo ooo www.ckl2.org 188 7) This table displays statistics for 21 of the Wild players for 2010-2011 regular season games. We are going to analyze the variable 'GP', which stands for games played. Forwards & Defense men # POS 24 1 R 9 15 8 PLAYER MARTIN HAVLAT MIKKO KOIVU D 7 C 96 C 21 C ANDREW BRUNETTE BRENT BURNS MATT CULLEN 20 R 22 R 11 C 55 D D 12 R 23 L 46 D PIERRE-MARC BOUCHARD KYLE BRODZWK ANTTI MIETTINEN CALCLUTTERBUCK JOHN MADDEN 16 D D MAREKZIDLICKY NICK SCHULTZ CHUCK KOBASEW ERIC NYSTROM JARED SPURGEON CLAYTON STONER BRAD STAUBITZ GREG ZANON 19 C 48 L 25 D PATRICK O'SULLIVAN GUILLAUME LATENDRESSE CAM BARKER GP G 78 22 71 17 82 18 80 17 78 12 59 12 80 16 73 16 76 19 76 12 A P +/- 40 62 -10 45 62 4 28 46 -7 29 46 -10 — I — 27 39 -14 26 38 -3 21 37 -4 19 35 -3 PIM PP 52 50 16 98 34 14 56 38 46 74 63 32 53 57 71 82 21 11 52 1 15 34 -5 13 25 -9 17 14 24 17 16 -6 12 -16 79 10 30 38 12 -1 19 30 96 173 48 GW 4 5 -10 34 S 229 191 117 170 150 98 126 168 191 107 53 46 74 83 38 40 29 55 37 18 44 S% 9.6 8.9 15.4 10.0 8.0 12.2 12.7 95 9.9 11.2 13.2 6.5 12.2 4.8 10.5 5.0 13.8 0.0 2.7 16.7 2.3 source: http://wild.nhl.com. July 25, 2011 a) Create a stem plot for the number of games played by these Wild players. b) Calculate the mean, median, mode, range for the number of games played by these Wild players. c) Describe the distribution of the number of games played by these players. Remember your S.O.C.C.S! 8) Now, you will examine the +/- data. a) Find out what +/- stands for? b) Construct a dot plot to show the +/- data, b) Describe the distribution. 189 www.ckl2.org 5.4 Numerical Data: Histograms Learning Objectives • Construct histograms • Describe distributions including shape, outliers, center, context, and spread. Histograms When it is not necessary to show every value the way a stem plot would do, a histogram is a useful graph. Histograms organize numerical data into ranges, but do not show the actual values. The histogram is a summary graph showing how many of the data points falling within various ranges. Even though a histogram looks similar to a bar graph, it is not the same. Histograms are for numerical data and each 'bar' covers a range of values. Each of these 'bars' is called a class or bin. Histograms are a great way to see the shape of a distribution and can be used even when working with a large set of data. The width of the bins is the most important decision when constructing a histogram. The bins need to be of consistent width (i.e. all cover a range of 10, or 25, etc.). It is generally a good idea to try to have 7 to 15 bins. Start with the range and divide by 10. This will give you a rough idea of how wide to make your bins. From there it becomes a judgment call as to what is a reasonable bin width. For example, it really does not make any sense to count by 11.24 just because that is what the range divided by 10 is equal to. In such a case, it might make more sense to count by 10's or 12's depending on the specific data. Example 1 Suppose that the test scores of 27 students were recorded. The scores were: 8, 12,17, 22, 24, 28, 31, 37, 37, 39, 40, 42, 43, 47, 48, 51, 57, 58, 59, 60, 65, 65, 74, 75, 84, 88, 91. The lowest score was an 8 and the highest was a 91. Construct a histogram. Solution Plan bin width: The first step is to look at the range (91 - 8 = 83). Then divide the range by 10 (83/10 = 8.3). It doesn't make any sense to count by bins of 8.3 points, so we may use 8, or 10, or 12. Next we look at where to start. The first number is 8. It doesn't make any sense to start counting at 8 either, or to end at 91. We will probably want to start from and end at 100, counting by 10's should work nicely. * Where to begin, and what to count by are not obvious to a calculator or many computer software programs. The graphing calculator would probably start at 8, and count by 8.3. Leaving you with bins of [8 -16.3); [16.3-24-6); [24-6 -32.9); etc. So, if you are using technology to create a histogram, you will generally need to fix the window so that the bins make sense. www.ckl2.org 190 Mark horizontal axis: Mark your scale along the horizontal axis to cover your entire range and to count by your decided upon bin width. Include numbers. Count number of values within each bin: How many values falls between and <10? One, so we make the bin one unit tall. Between 10 and <20? Two, so we make the bin two units tall, etc. A frequency table may be helpful here. You need to know how tall to make each bin. You especially need to know how tall to make the tallest of the bins. Mark vertical axis: Your vertical axis needs to reach the height of the tallest bin. Mark your vertical axis by consistent steps so that it will reach the number needed. Include numbers. *For instance, if you need to get to 2,460; then you should probably count by steps of 250 's or even a larger number. Make your histogram: Make the bins the correct heights, shade or color them in, add labels including any units, a title, and a key if needed. TEST SCORES 6 — — f - i — i — — — r - r - r 5 4 3 — — r - ■ i — r - i n "" 1 — r - i — r i i i i 1 — r - r - r i i i i 1 — \ - r - i- 2 i i - - r 1 I 40 50 60 70 30 SO 100 Scone rvcLMEiA.L-oiii Test score histogram, http://www.netmba.com The bins in this example are [0 to 10); [10 to 20); etc. This means that zero up to, but not including, 10 are in the first bin (9.999 would be in bin #1, but 10 would be in bin #2). You may be creating your histograms with paper and pencil. However, the graphing calculators are a great way to create histograms as well. It takes a little practice to learn how to adjust the windows, but you have the opportunity to try out different bin widths without needing to erase or start all over. Also, you may want to see how to create histograms in excel. When you use a graphing calculator to create your graphs, you should sketch what the calculator shows you. Your sketch should look similar to the graphing window shown, and will still need labels and titles. #*" 191 www.ckl2.org Example 2 a) Construct a histogram to look at the distribution of acceptance rates for these U.S. Universities. b) Describe your findings. The following table gives a list of the acceptance rate for applicants to twelve U.S. universities. (Source: Time Almanac 2004) College or Univeristy Percent Accepted Harvard University 11 Yale University 16 Princeton University 12 Johns Hopkins University 32 New York University 29 M. I. T. 16 Duke University 26 Carnegie Mellon University 36 George Washington University 49 Northwestern University 33 American University 72 Cornell University 31 http://jcsites.juniata.edu Solution a) Try this on your calculator: Enter the data in a list and set up a histogram. Plan bin width: Determine the range (72 -11= 61). Divide by 10 (61/10 = 6.1) to get a rough idea of a good bin width. We can use a variety of bin width of 5, 7.5, 8, or 10, etc. We must start before the minimum of 11 (start at or 10), and pass the maximum of 72 (80). After trying a few of these bins, we decide to use bins of 10, starting at 10 and ending at 80. Here is the window that was used: {x-min =10, x-max=80, x-scl=10, y-min=-2, y-max=5, y-scl=l} Mark horizontal axis: Mark your scale along the horizontal axis to cover your entire range and to count by your decided upon bin width. Include numbers. Count number of values within each bin: A frequency table may be helpful here. You need to know how tall to make each bin. You especially need to know how tall to make the tallest of the bins. Mark vertical axis: Your vertical axis needs to reach the height of the tallest bin. Mark your vertical axis by consistent steps so that it will reach the number needed. Include numbers. Make your histogram: Make the bins the correct heights, shade or color them in, add labels including and units, a title, and a key if needed. www.ckl2.org 192 I 4 £ 3 o 2 E 1 = z Percent Acceptance Rate 10 to 19 20 to 29 30 to 39 40 to 49 50 to 59 GO to 69 70 to 79 Percent Accepted b) Describe: The median and mean are difficult to identify from just a histogram. You will often only be able to estimate them. In this case, we were given all of the original data so we can find the exact values. When possible, identify outliers specifically. The median acceptance rate for these Universities is 30%. The percent of students applying, who are accepted to these universities ranged from 11% to 72%. However, the 72% was an extremely high outlier because the next highest rate was 1^9%). The majority of these schools accepted 36% or fewer of those who applied. The distribution is heavily skewed to the right because of the high outlier of American University. Section 5.4 Exercises 1) This graph shows the distribution of salaries (in thousands of dollars) for the employees of a large school district. Answer the questions that follow. GOO 530 | 400 2. 300 200 130 0-1 Q 11-21 22-32 33-43 44-54 55-65 66-76 77-87 83 + Salary ($ thousands) 193 www.ckl2.org Source: http://4.bp.blogspot.com a) Approximately how many employees make $77,000 or more per year? b) What is the bin width here? Be careful. c) Without calculating anything, how would you describe the typical salary of an employee of this school district? 2) Jessica is a freshman at the University of Minnesota, Duluth. She has been watching her weight because she is afraid of gaining that 'freshman fifteen' she keeps hearing about. She has weighed herself every Monday morning since school started. Here is a histogram showing the results in pounds of all of these Monday-Morning- Weigh-in's. £ <D 3 m 8 - 7 - 6 - 5 - 4 - 3 - 2 - 1 - - 132 134 136 138 140 142 144 146 148 Weight of Jessica a) Describe the distribution. Remember your S.O.C.C.S! b) What is the range for the bin that has 6 observations? c) For her height, Jessica feels that 140 lbs. is her ideal weight. What percent of the time has she been within 5 lbs. of her ideal weight? www.ckl2.org 194 3) Pretend you are a journalist. a) What do you notice that is wrong with this graph? b) Based on only what you can see in the graph and labels, write several sentences that could go with this graph. (Think S.O.C.C.S!) Ignore the mistake from part (a). Percentage of men spending at least one hour per week participating in sports or exercise, by age 70 60 . 50 3 in £ 40 o a 30 Q. 20 10 1 | 16-24 25-34 35-44 45-54 55-64 65-74 over 75 Age Source: Department of Health Men and exercise graph: http://www2.le.ac.uk 195 www.ckl2.org 4) Here are the statistics from several of the Minnesota Wild players. We are going to analyze the Penalties in Minutes (PIM) data. Forwards & Defensemen 24 POS PLAYER GP G A P +/- PIM PP SH GW s S% R -10 52 3 4 229 9.6 MAK I IN HAVLftl to £{. 4U at. 9 C MIKKO KOIVU 71 17 45 62 4 50 7 1 3 191 8.9 15 L , . -7 117 15.4 ANuKbW bKUNb I I b 82 la ^ : - ; 16 8 3 8 D BRENT BURNS 80 17 29 46 -10 98 8 3 170 10.0 7 C -14 34 5 4 ? 150 8.0 \d. £1 do 96 C PIERRE-MARC BOUCHARD 59 12 2ti 38 -3 14 2 98 12.2 21 ° 16 21 37 -4 56 2 1 1 126 12.7 KYLE BRODZIAK 80 20 R ANTTI MIETTINEN 73 16 19 35 -3 38 8 4 168 9.5 22 R -6 79 4 3 191 9.9 UALULUI I bKBUUK lx> iy id -ft 11 C JOHN MADDEN 76 12 13 25 -9 10 1 1 4 107 11.2 3 D 17 24 -6 30 3 53 13.2 MAREK ZIDLICKY 46 7 55 D NICK SCHULTZ 74 3 14 17 -4 38 46 6.5 12 R 9 7 16 -6 19 1 74 12.2 CHUCK KOBASEW 63 23 L ERIC NYSTROM 82 4 8 12 -16 30 1 83 4.8 46 D 4 8 12 -1 38 10.5 JAKbU arUKutUN bi l I u 1 4 ° CLAYTON STONER 57 2 7 9 5 96 1 40 5.0 16 R 4 5 9 -5 173 1 29 13.8 tsKAU b lAUHl I L /I 5 D GREG ZANON 82 7 7 -5 48 55 0.0 19 C PATRirK n ic ;iji i ivaw 91 1 6 7 -1 T n n n X! 3 7 48 L GUILLAUME LATENDRESSE 11 3 3 6 2 8 1 1 18 16.7 25 D CAM BARKER 52 1 4 5 -10 34 1 44 2.3 a) Construct a histogram for the penalties in minutes for the Wild players included on that list. b) Describe the distribution. Remember your S.O.C.C.S! www.ckl2.org 196 5) Refer to the Age at Inauguration of the United States Presidents data in this table. George Washington 57 Abraham Lincoln 52 Herbert Hoover 54 John Adams 61 Andrew Johnson 56 Franklin Roosevelt 51 Thomas Jefferson 57 Ulysses S. Grant 46 Harry S. Truman 60 James Madison 57 Rutherford B. Hayes 54 Dwight D. Eisenhower 62 James Monroe 58 James A. Garfield 49 John F Kennedy 43 John Quincy Adams 57 Chester A. Arthur 50 Lyndon B. Johnson 55 Andrew Jackson 61 Grover Cleveland 47 Richard M. Nixon 56 Martin Van Buren 54 Benjamin Harrison 55 Gerald Ford 61 William Henry Harrison 68 Grover Cleveland 55 Jimmy Carter 52 John Tyler 51 William McKinley 54 Ronald Reagan 69 James K, Polk 49 Theodore Roosevelt 42 George H. W. Bush 64 Zachary Taylor 64 William Howard Taft 51 Bill Clinton 46 Millard Fillmore 50 Woodrow Wilson 56 George W. Bush 54 Franklin Pierce 48 Warren G, Harding 55 Barack Obama 47 James Buchanan 65 Calvin Coolidge 51 Source: http://qrc.depaul.edu a) Construct a histogram for the distribution of ages of the U.S. Presidents at their inauguration. b) Calculate the mean, median, mode. c) Which of these three measures of central tendency best describe the typical inauguration age? Explain. 6) Sketch a histogram that fits the following scenarios: a) Symmetrical with a few high outliers and a few low outliers. b) Strongly skewed right with no outliers. c) Bimodal and symmetrical. d) Skewed left with a few outliers. e) Doesn't fit any of the descriptions we have learned. 197 www.ckl2.org 5.5 Numerical Data: Box Plots &; Outliers Learning Objectives • Calculate the five number summary for a set of numerical data • Construct box plots • Calculate IQR and standard deviation for a set of numerical data • Determine which numerical summary is more appropriate for a given distribution • Determine whether or not any values are outliers based on the 1.5* (IQR) criterion • Describe distributions in context- including shape, outliers, center, and spread Box Plots A box plot (also called box-and-whisker plot) is another type of graph used to display data. A box plot divides a set of numerical data into quarters. It shows how the data are dispersed around a median, but does not show specific values in the data. It does not show a distribution in as much detail as does a stem plot or a histogram, but it clearly shows where the data is located. This type of graph is often used when the number of data values is large or when two or more data sets are being compared. The center and spread of the distribution are very obvious from the graph. It is easy to see the range of the values as well as how these values are distributed around the middle value. The smaller the box, the more consistent the data values are with the median of the data. The shape of the box plot will give you a general idea of the shape of the distribution, but a histogram or stem plot will I do this more accurately. Any outliers will show up as long whiskers. The box in the box plot contains the middle 50% of the data, and each 'whisker' contains 25% of the data. The Five Number Summary In order to divide into fourths, it is necessary to find five numbers. This list of five values is called the five number summary. The numbers in the list are {minimum value, Quartile 1, Median, Quartile 3, maximum value}. We have already learned how to find the median of a set of numbers (put in order and find the middle value), and the minimum and maximum are the smallest and largest numbers. Now we will learn how to find the quartiles. 5# sum = {mm. Qi, Med, Q3, max} www.cki2.0rg 198 Quartiles The first step is to list all of the numbers in order from least to greatest. The minimum and maximum are now on the ends of the list and we can count in to find the median-circle these three values. Finding the quartiles is just like finding the median. Quartile 1 is the 'median' of all of the values to the left of the median (do NOT include the median itself). Quartile 3 is the 'median' of all of the values to the right of the median (do not include the median). Constructing a Box Plot Now list the five number summary in order {min, Ql, Med, Q3, max). The next step is to mark an axis that covers the entire range of the data. Mark the numbers along the axis before you make the box plot, so that the resulting plot shows the shape of the data. The last step is to place a dot above the axis for the 5 numbers from the five number summary, and then to make a 'box' through the second and fourth dots, mark a line through the middle dot to show the median, and mark 'whiskers' from the box out to the first and fifth dots. #*** Example 1 You have a summer job working at Paddy's Pond which is a recreational fishing spot where children can go to catch salmon which have been raised in a nearby fish hatchery and then transferred into the pond. The cost of fishing depends upon the length of the fish caught ($0.75 per inch). Your job is to transfer 15 fish into the pond three times a day. But, before the fish are transferred, you must measure the length of each one and record the results. Below are the lengths (in inches) of the first 15 fish you transferred to the pond. Calculate the five number summary, and construct a box plot for the lengths of these fish. Length of Fish (in.) 13 14 6 9 10 21 17 15 15 7 10 13 13 8 11 Solution Since box plots are based on the median and quartiles, the first step is to organize the data in order from smallest to largest. 6 7 8 9 10 10 11 13 13 13 14 15 15 17 21 199 www.ckl2.org 6, 7, 8, 9, 10, 10, 11, 13 , 13, 13, 14, 15, 15, 17, 21 The minimum is the smallest number (min = 6), and the maximum is the largest number (max = 21). Next, we need to find the median. This has an odd number of data, so the median of all the data is the value in the middle position (Med = 13). There are 7 numbers before and 7 numbers after 13. The next step is the find the median of the first half of the data - the 7 numbers before the median, but not including the median. This is called the lower quartile since it marks the point above the first quarter of the data. On the graphing calculator this value is referred to as Q\. 6,7,8, 9 ,10,10,11 Quartile 1 is the median of the lower half of the data (Qi = 9). This step must be repeated for the upper half of the data - the 7 numbers above the median of 13. This is called the upper quartile since it is the point that marks the third quarter of the data. On the graphing calculator this value is referred to as Q3. 13,13,14, 15 ,15,17,21 Quartile 3 is the median of the upper half of the data (Q3 = 15). Now that the five numbers have all been determined, it is time to construct the actual graph. The graph is drawn above a number line that includes all the values in the data set (graph paper works very well since the numbers can be placed evenly using the lines of the graph paper). For this examle we will need to mark from at least 6 to at least 21. Be sure to mark your axis before you start to construct the box plot. Next, represent the following values by placing dots above their corresponding values on the number line: Minimum - 6 Quartile 1-9 Median - 13 Quartile 3-15 Maximum - 21 The five data values listed above are often called the five number summary for the data set and are necessary to graph every box plot. Make the 'box' part around the Qi and Q3 values, make 'whiskers' out to the min and max values, and make a vertical line to show the location of the median. This will complete the box plot. Length of fish (in inches) 5# summary = {6, 9, 13, 15, 21} 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 3C www.ckl2.org 200 The five numbers divide the data into four equal parts. In other words: • One-quarter of the data values are located between 6 and 9 • One-quarter of the data values are located between 9 and 13 • One-quarter of the data values are located between 13 and 15 • One-quarter of the data values are located between 15 and 21 More Measures of Spread Range We have already learned how to find the range of a set of data. The range represents the entire spread of all of the data. The formula for calculating the range is: max - min = range Inner Quartile Range The quartiles give us one more measure of spread called the inner quartile range. The inner quartile range (IQR) is the range between the lower and upper quartile. To find the IQR, subtract the quartile 1 value from the quartile 3 value (Q3 - Qi = IQR). The IQR represents the spread, or range, of the middle 50% of the data. The IQR is a measure of spread that is used when the median is the measure of central tendency. The formula for calculating the IQR is: Qs - Qi = IQR Standard Deviation Another measure of spread that is used in statistics is called the standard deviation. The standard deviation measures the spread around the mean. This value is more difficult to calculate than range or IQR, but the formula used takes all of the data values in the distribution into account. Standard deviation is the appropriate measure of spread when the mean is the measure of center. However, the standard deviation is easily affected by outliers or skewness because every value is calculated in the formula. The symbol for standard deviation of a sample is s (on the graphing calculators it is S x -\ and for a population it is cr (sigma). The standard deviation can be any number zero or greater. It will only be equal to zero if there is no spread (i.e. all values are exactly the same). The more spread out the data is, the larger the standard deviation will be. The standard deviation is most appropriate when you have a very symmetrical, bell-shaped distribution called a normal distribution. We will study this type of distribution in chapter 7. 201 www.ckl2.org Which Numerical Summary Should We Use? We have learned several statistics that are measures of central tendency and several that are measures of spread. How do we know which one(s) to use? The mean and standard deviation go together. And, the median will go with the IQR (or range). The most important thing to remember is that the mean and the standard deviation are both affected by outliers and by skewness in a distribution. So if either of these is present, then the mean and standard deviation are not appropriate. However, it is always an option, and often interesting to calculate all of the statistics and compare them to one another. The general guidelines are: Skewed Distribution or Oultiers Median as center IQR & range as spread Symmetrical No outliers Bell Shaped Mean as center Standard deviation as spread How to Calculate the Standard Deviation In order to calculate the standard deviation you must have all of the values. Then you follow these steps: 1. Calculate the mean of the values. 2. Subtract the mean from each data value. These are the individual deviations. 3. Each of these deviations is squared. 4. All of the squared deviations are added up. 5. This total of the squared deviations is divided by one less than the number of deviations. This is the variance. 6. Take the square root of the variance. This is the standard deviation. The formula for calculating the variance is: ^ = ^tZ- = i(^-^) 2 The formula for calculating standard deviation is: V^Z7=i(*-- As you can probably tell, this formula is very time consuming when you have a large set of data. Also, it is easy to make a mistake in your calculations. We will show the process with a small set of data, but generally we will use our calculator to find the standard deviation. www.ckl2.org 202 Example 2 Find the mean and standard deviation for these numbers:{12, 15, 14, 17, 19} Solution 1. Calculate the mean of the values. = = 15.4 5 2. Subtract the mean from each data value. These are the individual deviations. 3. Each of these deviations is squared. 4. All of the squared deviations are added up. 5. This total of the squared deviations is divided by one less than the number of deviations. This is the variance. 6. Take the square root of the variance. This is the standard deviation. Data values Value - mean = deviation Deviation squared X ( x — x) ( x — x) 2 12 (12 -15.4) = -3.4 (-3.4) 2 = 11.56 15 (15 -15.4) = -0.4 (-0.4) 2 = 0.16 14 (14 -15.4) = -1.4 (-1.4) 2 = 1.96 17 (17 -15.4) = 1.6 (1.6) 2 = 2.56 19 (19 -15.4) = 3.6 (3.6) 2 = 12.96 Sum of the squared deviations 29.2 . , . sir Variance = m 1 7 29.2 n- ^ 5-1 ^ Standard Deviation = vs 2 sj= V7.3 = 2.7019 The mean of these numbers is 15.4 o,nd the standard deviation is 2.7019. The standard deviation is tedious to calculate. For any problem where you are asked to calculate the standard deviation, you may use your calculator or a computer to find it. 203 www.ckl2.org Example 3 After one month of growing, the heights of 30 parsley seed plants were measured and recorded. The measurements (in inches) are shown in the table below. Table 5.8: Heights of Parsley (in.) 22 28 30 40 38 18 11 37 12 34 49 17 25 37 46 39 8 27 16 38 18 23 26 14 6 26 23 33 11 26 a) Calculate the five number summary and construct a box plot to represent the data. b) Describe the distribution. c) Calculate the mean and standard deviation. d) Calculate the median, and IQR Solution a) five number summary and box plot: order the values- The data organized from smallest to largest is shown in the table below. (You could use your calculator to quickly sort these values) Table 5.9: Heights of Parsley (in.) 6 8 11 11 12 14 16 17 18 18 22 23 23 25 26 26 26 27 28 30 33 34 37 37 38 38 39 40 46 49 www.ckl2.org 204 5# summary- This time there is an even number of data values so the median will be the mean of the two middle values. Med = — i — = 26 (We will not use the median, but we do use the values on either side of it when finding quartiles). The median of the lower half is the number in the 8th position which is 17. The median of the upper half is the number in the 22nd position (or 8th from the top) which is 37. The smallest number is 6 and the largest number is 49. 5# summary = {6, 17, 26, 37, 49} Height of Parsley Seedlings 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 b) describe-don't forget your S.O.C.C.S! The heights of these parsley plants ranged from 6 inches to 49 inches after one month. The distribution is very symmetrical and does not contain any outliers. The median height for these parsley plants was 26 inches tall. The middle 50% of the plants were all between 11 inches and 31 inches tall. c) The mean and standard deviation were calculated using the TI-84+. x = 25.9333 s = 11.4709 d) The median is part of the five number summary. The IQR = Q3 - Qi = 37 - 17 = 20 Med = 26 IQR = 20 205 www.cki2.0rg Outliers ***** We have been noticing some values that appear to be outliers, but have not defined a specific distance to be considered an outlier. The common outlier test, used to determine whether or not any of the values are outliers uses the IQR. This outlier test, often called the 1.5*(IQR) Criterion, says that any value that is more than one and one-half times the width of the IQR box away from the box is an outlier. loixx a„s xx i« Any value past the cutoff points (shown as dashed lines above) will be considered an outlier. The above example would have at least one low outlier, but no high outliers. Cutoff value for LOW OUTLIERS: Ql - 1.5* (IQR) *any value less than this number is considered a low outlier Cutoff value for HIGH OUTLIERS: Q3 + 1.5* (IQR) *any value greater than this number is considered a high outlier www.ckl2.org 206 Example 4 Test the sodium in the McDonald's® sandwiches listed in the table in exercise number one for outliers. Use the 1.5*(IQR) Criterion. Show your steps. Solution Calculate the five number summary for the Amount of Sodium (in mg) fivenumber summary = {520,835,1095,1285,2070} First find the IQR: IQR = 1285 - 835 = 450 Test for low outliers: Q\ - 1.5(1 QR) 835 - 1.5(450) = 160 Test for high outliers: Q3 + 1.5(IQR) 1285 + 1.5(450) = 1960 Check the data to see if we have any outliers: We have no sandwiches with less than 160 mg sodium, so we have no low outliers. We have one value that is greater than this cutoff of 1960 mg. The Angus Bacon & Cheese burger has 2070 mg of sodium, so we have one high outlier. 207 www.ckl2.org Section 5.5 Exercises 1) Here is some nutritional information about a few of the sandwiches on the McDonald's® menu. Nutrition Facts 1 w ? I I «! o (J - E *^ v> 1 o 1 u u. — I s — ■ ■ iff S i '5 as | i "a 1 o .c 're a — E E 1 in *9 s >• 'S a Sandwiches Hamburger 3.5 oz (100 g} 250 80 9 13 3.5 16 0.5 25 9 520 22 Cheeseburger 4 oz (114g> 300 110 12 19 6 28 0.5 40 13 750 31 Double Cheeseburger 5.8 oz (tesg) 440 210 23 35 11 54 1.5 80 26 1150 48 VIcDouble 5.3 oz (151 g) 390 170 19 29 3 42 1 65 22 920 38 Quarter founder® with Cheese-i- 7 oz (198g> 510 230 26 40 12 61 1.5 90 31 1190 50 Double Quarter 3 ounder®with Cheese++ 9.8 oz (279 g) 740 380 42 65 19 95 2.5 155 52 1380 57 Big Mac® 7.5 oz (214 g) 540 260 29 45 10 50 1.5 75 25 1040 43 Big N' Tasty® 7.2 oz (206 g) 460 220 24 37 S 42 1.5 70 23 720 30 Jig N' Tasty® with Cheese 7.7 oz (220 g> 510 250 28 43 11 54 1.5 85 28 960 40 Angus Bacon & Cheese 1 0.2 oz (291 g) 790 350 39 60 17 87 2 145 49 2070 86 Angus Deluxe 11.1 oz (314 g> 750 350 39 60 16 82 2 135 45 1700 71 Angus Mushroom & Swiss 10 oz (283 g) 770 360 4D 61 17 85 2 135 46 1170 49 Source: http://nutrition.mcdonalds.com. July 27, 2011. Determine the median and the IQR for the following data regarding the McDonald's® sandwiches: a) Calories from fat b) Cholesterol www.ckl2.org 208 2) Analyze the calories for these McDonald's® sandwiches. a) Calculate the five number summary and construct an accurate box plot for the calories for these sandwiches. b) Use the outlier test to determine whether there are any outliers for calories. Test for both high and low outliers. Show your steps. c) Describe the distribution in context- Remember your S.O.C.C.S! 3) Analyze the sodium content further. a) Construct a box plot for sodium. b) Calculate the median and IQR for sodium (see example 4). c) Calculate the mean and standard deviation for sodium (use a calculator). Now remove the high outlier from the data. d) Re-calculate the median and IQR for sodium with the Angus Bacon & Cheese data removed. Did either value change from part (b)? e) Re-calculate the mean and standard deviation for sodium with the Angus Bacon & Cheese data removed. Did either value change from part (c)? 209 www.ckl2.org 4) The following table shows the potential energy that could be saved by manufacturing each type of material using the maximum percentage of recycled materials, as opposed to using all new materials. Table 5.10: Manufactured Material Energy Saved (millions of BTU's per ton) Aluminum Cans 206 Copper Wire 83 Steel Cans 20 LDPE Plastics (e.g. trash bags) 56 PET Plastics (e.g. beverage bottles) 53 HDPE Plastics (e.g. household cleaner bottles) 51 Personal Computers 43 Carpet 106 Glass 2 Corrugated Cardboard 15 Newspaper 16 Phone Books 11 Magazines 11 Office Paper 10 Source: National Geographic, January 2008. Volume 213 No., pg 82- a) Calculate the five number summary and construct an accurate box plot for the Energy Saved data. b) Use the outlier test to determine whether there are any outliers. Show your steps. c) Calculate the mean and standard deviation for the Energy Saved data. How do the mean and the median compare? d) Delete any outliers. Recalculate the five number summary, mean and standard deviation. Which values changed? www.ckl2.org 210 5) The Burj Dubai will become the world's tallest building when it is completed. It will be twice the height of the Empire State Building in New York. The chart lists the 15 tallest buildings in the world (as of 12/2007). Table 5.11: Building City Height (ft) Taipei 101 Shanghai World Financial Center Petronas Tower Sears Tower Jin Mao Tower Two International Finance Cen- ter CITIC Plaza Shun Hing Square Empire State Building Central Plaza Bank of China Tower Bank of America Tower Emirates Office Tower Tuntex Sky Tower Burj Dubai Tapei Shanghai Kuala Lumpur Chicago Shanghai Hong Kong Guangzhou Shenzen New York Hong Kong Hong Kong New York Dubai Kaohsiung 1671 1614 1483 1451 1380 1362 1283 1260 1250 1227 1205 1200 1163 1140 a) Calculate the five number summary for these 15 buildings and construct an accurate box plot. b) Use the outlier test to determine whether there are any outliers among these 15 buildings. Test for both high and low outliers. Show your steps. c) Describe the shape of the distribution. Remember your S.O.C.C.S! d) Within what range of heights are the middle 50% of these buildings? 211 www.ckl2.org 6) The table shows the U. S. Presidents' ages at inauguration. George Washington 57 Abraham Lincoln 52 Herbert Hoover 54 John Adams 61 Andrew Johnson 56 Franklin Roosevelt 51 Thomas Jefferson 57 Ulysses S. Grant 46 Harry S, Truman 60 James Madison 57 Rutherford B. Hayes 54 Dwight D. Eisenhower 62 James Monroe 58 James A. Garfield 49 John F Kennedy 43 John Quincy Adams 57 Chester A. Arthur 50 Lyndon B. Johnson 55 Andrew Jackson 61 Grover Cleveland 47 Richard M. Nixon 56 Martin Van Buren 54 Benjamin Harrison 55 Gerald Ford 61 William Henry Harrison 68 Grover Cleveland 55 Jimmy Carter 52 John Tyler 51 William McKinley 54 Ronald Reagan 69 James K. Polk 49 Theodore Roosevelt 42 George H. W. Bush 64 Zachary Taylor 64 William Howard Taft 51 Bill Clinton 46 Millard Fillmore 50 Woodrow Wilson 56 George W. Bush 54 Franklin Pierce 48 Warren G Harding 55 Barack Obama 47 James Buchanan 65 Calvin Coolidge 51 a) Construct a box plot for the Presidents ages at inauguration. b) Make a statement, in context, about what the 'box' part of the box plot tells you. c) Describe the distribution. Remember your S.O.C.C.S! Identify any unusual values specifi- cally. www.ckl2.org 212 7) Several game critics rated the Wow So Fit game, on a scale of 1 to 100 (100 being the highest). The results are presented in this stem plot: S 2 3 5 7 4 5 Key: 6 | 7 = 67 8 9 2 2 4 6 7 7 9 002567799 1122356679 2 2 2 6 7 a) Calculate the five number summary for the Wow So Fit data. b) Construct a box plot for the data. c) Describe this distribution. d) Calculate the mean and standard deviation. 213 www.ckl2.org 5.6 Numerical Data: Comparing Data Sets Learning Objectives • Construct parallel box plots • Construct back-to-back stem plots • Compare more than one set of numerical data in context Parallel Box Plots Parallel box plots (also called side-by-side box plots) are very useful when two or more numerical data sets need to be compared. The graphs of the parallel box plots are plotted, one parallel to the other, along the same number line. This can be done vertically or horizontally and for as many data sets as needed. Example 1 The figure shows the distributions of the temperatures for three different cities. By graphing the three box plots along the same axis, it becomes very easy to compare the temperatures of the three cities. What are some conclusions that can be drawn about the temperatures in these three cities? Temperature Range by City 9225 £ 722S Z p 62:25 ■ i225 '::: c rv 1 C(V2 Qtsi CIV3 http://www.mathworksheetscenter.com www.ckl2.org 214 Solution Here are some conclusions, based on the graphs, that might be made. Think S.O.C.C.S! And, be sure to compare the distributions to one another, using statistics to support your observations. Quartile 1 for City 2 is higher than the quartile 3 in City 1 and the median in City 3. Also, the minimum temperature in City 2 is at about the median for the other two cities. City 2 is generally warmer than both of the other cities. Cities 1 and 3 have nearly the same median temperature, around 60° to 63° . Whereas, the median temperature in City 2 is around 82° . City 3 has a much larger range in temperatures (35° to 85°), than City 1 (45° to 75°) or City 2 (62° to 95°). Thus, the temperature in City 1 is the most consistent of the three. The temperature distributions in all three cities are fairly symmetrical and none have any outliers. Comparing Numerical Data Sets When you are given numerical sets of data for more than one variable and asked to compare them, it will be necessary to construct graphical representations for each data set. In order to compare them to one another the scales must match. When comparing more than one box plot, we construct parallel box plots. When using histograms, we can match the horizontal and vertical scales so that the separate histograms can 'line up'. Dot plots will work the same way as histograms. Such comparisons are also possible when working with stem plots. Two sets of numerical data can simply share the stems in the middle, with one set's 'leaves' going to the right and the other set's 'leaves' going to the left. On both sides of the plot, the 'leaves' will go in numerical order out. Plots like these are called back-to-back stem plots. Once you have constructed any of these types of comparative graphical representations (on the same scale,) you can make observations about how the data sets are the same and how they are different. Just as we have been doing up to this point, those comparisons should be done in context. The observations made might address the shapes of the distributions and whether or not any outliers are present. It is important to compare the centers of the distributions (means, medians, or modes). And, the spreads of the distributions should also be addressed (ranges, IQRs, or standard deviations). Example 2 A teacher gave the same physics exam to her two sections of physics. She has been wondering whether the first period and fifth period classes are learning the same amount as one another. She constructed this back-to-back stem plot to compare the test scores for the two different classes. a) Calculate the five number summary for both classes. b) Calculate the mean and standard deviation for both classes. c) Compare the two classes' test scores in context. 215 www.ckl2.org Class A Class B Leaves Stems Leaves 8 6 5 7 13 3 5 6 7 6 4 8 4 5 6 6 4 4 2 1 9 10 1 2 http://www.basic-mathematics.com Solution a) The numbers in the stem plots are already in order, so these statistics could be found by hand or with a graphing calculator. Five number summary for Class A {60, 75, 90.5, 94, 100} Five number summary for Class B {60,71,75.5,85,92} b) These statistics are most efficiently found using a graphing calculator. Class A mean x « 85.7143 Class A standard deviation s « 12.6396 Class B mean x * 76.6429 Class B standard deviations s « 10.0507 C) Comparison Overall, Class A did better on this test than Class B did. Class A 's scores on this test are skewed to the left, but Class B's scores are skewed to the right. Neither class has any outliers among the test scores. Class A has a mean score of about 9 points higher (85.7 compared to 76.6) and a median score of 15 points higher (90.5 compared to 75.5). The overall range for the two classes is fairly similar, but the Class A students' scores were less consistent. The ranges (32 and 40), IQRs (14 and 19), and standard deviations (10.1 and 12.6), all show that Class B's scores are less spread out than Class A 's scores. www.ckl2.org 216 Example 3 An oil company claims that its premium grade gasoline contains an additive that significantly increases gas mileage. They conducted the following experiment in an effort to prove their claim. They selected 15 drivers who all drove the same make, model and year of car. Starting with an empty gas tank, each car was filled with 45L of one of the two types of gasoline (selected in a random order). The driver was asked to drive until the fuel light warning came on. The number of kilometers was recorded and then the car was filled with the other type of gasoline (whichever they had not used yet). The process was repeated and the number of kilometers was again recorded. The results below show the number of kilometers each car traveled. Regular Gasoline Premium Gasoline 640 570 640 580 610 659 619 639 629 664 540 555 588 615 570 635 709 637 633 618 550 590 585 587 591 694 638 689 589 500 Display each set of data to explain whether or not the claim made by the oil company is true or false. Solution order the data— list the values in order for each set of data Regular Gasoline Premium Gasoline 540 550 555 570 570 500 589 618 619 629 580 585 587 588 590 633 635 637 638 639 591 610 615 640 660 659 664 689 694 709 5 # summaries- Determine the five number summary for each set of data separately. Be sure to report your five number summary, whether asked to or not. Five-Number Summary Regular Gasoline Premium Gasoline Smallest # 540 500 Q, 570 619 Median 587 637 Q, 610 664 Largest # 660 709 217 www.ckl2.org box plots -Mark your number axis so that it covers the entire range needed - smallest minimum to largest maximum (we need 500 to 709 for these two data sets). Then graph each box plot along the same axis, but parallel to each other. This allows for the two data sets to be easily compared to one another. Key: blue = regular gasoline gold = premium gasoline Regular Gasoline vs. Premium Gasoline 490 510 530 550 570 590 610 630 650 670 690 710 conclusions— make comparisons by looking for any similarities and differences between the two distributions. Remember your S.O.C.C.S! Based on this experiment, the number of kilometers that the cars were able to travel on the premium gasoline was greater than the number of kilometers that the same cars were able to travel with the regular gasoline. The median number of kilometers for premium gasoline was 637, compared to 587 for regular gas. The first quartile for premium was higher than the third quartile for regular. Also, 25% of those with the premium gasoline went further than all of those using regular gasoline. The distribution for the regular fuel is slightly skewed to the right, but doesn't have any outliers. However the premium distribution is strongly skewed to the left toward one outlier on the low end (500 km). Based on these results, it appears that the additive in the premium gasoline does improve gas mileage for this make and model of car. Further tests should be done on other types of vehicles. www.ckl2.org 218 Example 4 The heights of a group of students are all included in the first histogram. The second histogram only contains the data from the male students and the third is a graph of the heights of only the girls. Explain what this histograms show. Heights of CSE 200 students 10 60 62 64 66 68 70 72 74 76 78 80 Height (in) Heights of male CSE 200 students 60 62 64 66 68 70 72 74 76 78 80 Height (in) Heights of female CSE 200 students 60 62 64 66 68 70 72 74 76 78 80 Height (in) Solution The range of heights of all students in this group is approximately 20 inches. However, the female heights only range about 11 inches and the male heights only range about 13 inches. The females' height distribution is the most symmetrical of all three. There is one male whose height is a high outlier, but none for the females. The median height for the class is around 10 inches, for males it is slightly higher around 12 inches, and for females it is around 65 inches tall. In general, the female students tend to be shorter than the male students. 219 www.ckl2.org Section 5.6 Exercises 1) Compare the %Daily Value for Total Fat(g) to the %Daily Value for Saturated Fat(g ) for these McDon- ald's® sandwiches. a) Calculate the five number summary for both %Daily Values. b) Construct parallel box plots for both. d) Make at least four observations to compare these two distributions. Nutrition Facts 8 8 £ 1 tfi V) Si l. o a ll. § 1 1 o ■ 5 U. 2 o h- 1 AS > -2 1 E 07 ■ S IS a as I i 2 a > s D B E E 1 *s > a D Sandwiches Hamburger 3.5 oz (100 g) 250 BO 9 13 3.5 16 0.5 25 9 520 22 Cheeseburger 4 oz (114 g) 300 110 12 19 6 28 0.5 40 13 750 31 Double Cheeseburger 5.8 oz (165 g) 440 210 23 35 11 54 1.5 80 26 1150 48 Vic Double 5.3 oz (151 g) 390 170 19 29 8 42 1 65 22 920 38 Quarter 3 ou nder® with Cheese* 7 oz (198 g} 510 230 26 40 12 61 1.6 90 31 1190 50 Double Quarter 3 ou rider® with 3heese++ 9.8 oz (279 g) 740 380 42 65 19 95 2.5 155 52 1380 57 Uig Mao® 7.5 oz (214 g) 540 260 29 45 10 50 1.6 75 26 1040 43 Big N' Tasty® 7.2 oz (206 g) 460 220 24 37 8 42 1.6 70 23 720 30 3ig N' Tasty® with Cheese 7.7 OZ (220 g) 510 250 28 43 11 54 1.5 85 28 960 40 ^ngus Bacon & Cheese 10.2oz (291 g) 790 350 39 60 17 87 2 145 49 2070 86 4ngus Deluxe 11.1 oz (314 g) 750 350 39 60 16 82 2 135 45 1700 71 ^ngus Mushroom & Swiss 10 oz (283 g) 770 360 40 61 17 85 2 135 46 1170 49 Source: http://nutrition.mcdonalds.com. July 27, 2011. www.ckl2.org 220 2) The heights of the students in a statistics class were all measured to the nearest inch. The results are presented in this back-to-back stem plot. Notice that it is also a split stem plot. The girls' heights are ordered out to the right on the right side. And the boys' heights are ordered out to the left on the left side. The Heights of the Students in Our Class (in inches) Boys' Heights Girls' Heights 5 8 9 4 3 6 112 3 4 4 9 8 7 7 6 5 6 5 6 6 7 9 4 3 30UO 7 >s \\ That 2 represents the height of 72 inches for one of the boys in this class. a) Compute the standard deviation, the range, and the IQR for both girls and boys. b) Compare the spread for the two groups, based on your answers to (a), in context. c) Compute the mean, median, and mode for both boys and girls. d) Compare the center for the two groups, based on your answers to (c), in context. e) Compare the shape of the distributions, based on the graph, in context. 221 www.ckl2.org 3) Compare the results of the Probability and Statistics District Common Assessment for two statistics classes. CLASS 3: 45, 37, 14, 42, 24, 33, 41, 16, 39,24, 38, 35, 35, 32, 51, 46, 30, 42, 25, 37,37, 19, 26, 23, 28, 38, 16, 35, 21 CLASS 4: 35,37,25, 44, 31, 27, 26,35, 24,41, 30, 29,30,29,40, 25, 38, 31,42, 46,37, 32, 20,40,35, 29, 25, 31, 27,43, 27,30, 38, 36,37 a) Construct back-to-back stem plots (use split-stems) for these two classes. b) Calculate the five number summaries for both classes. c) Calculate the following statistics for both classes: mean, standard deviation, mode, range, and IQR. d) Compare and contrast the two distributions. This should be in context and you should make at least four distinct observations. 4) The number of home-runs during a season is one of the statistics recorded about baseball players. The following table has the number of home-runs (over many seasons) for several of the best hitters in baseball. Compare the home-run hitting performance of these exceptional baseball players. > Babe Ruth: 54, 59, 35, 41, 46, 25, 47, 60, 54, 46, 49, 46, 41, 34, 22 > Mark McGwire: 49, 32, 33, 39, 22, 42, 9, 9, 39, 52, 58, 70, 65, 32, 29 > Barry Bonds: 16, 25, 24, 19, 33, 25, 34, 46, 37, 33, 42, 40, 37, 34, 49, 73, 46 > Roger Maris: 13, 23, 26, 16, 33, 61, 28, 39, 14, 8 a) Construct Parallel Box Plots for the four players. Be sure to use the same scale for all four graphs and to label each graph. b) Calculate the following for all four players: x IQR 5 # summary = { , -5 5 c) Test for outliers, for all four players, using the 1.5*IQR criterion- (show work). d) Compare and contrast the four distributions. This should be in context and you should make at least four distinct observations. www.ckl2.org 222 5) The following box plots show the average miles per gallon (city) for various types of vehicles. Comment on what these parallel box plots show. This should be in context and include at least 4 distinct observations. o co o 00 o in o o 7 I I I I I I T [ I i J ^^^^^^^^^^™ I I o Compact Large Midsize Small Sporty Van boxplot(MPG. city ~ Type) # base package http://www.fort.usgs.gov 223 www.ckl2.org 6) Refer to the four dot plots to answer the questions that follow. Graph I o o I § i i ■ s s o 8 o o o o o o o o 20 40 60 80 Graph II § o 8 o o OOP § § o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o OOP 20 40 Graph III 60 80 o o o o o o o o o o O OOP o O O O 20 40 60 Graph IV O o o o o o o o o o O o o o o o o o POO o o o o o o o o o o o o o o 20 40 60 o o 8 o o o o o 80 o o o o o o o o o o o o o o 80 a) Identify the overall shape of each distribution. b) How would you characterize the center(s) of these distributions? c) Name at least two statistics that would most likely be the same for all four of these distri- butions. d) Which of these distributions has the smallest standard deviation? Which of these distribu- tions has the largest standard deviation? Explain. e) For which of these distributions would it be appropriate to use the mean and standard devia- tion as numerical summaries? For which would the five number summary be more appropriate? www.ckl2.org 224 5.7 Chapter 5 Review Chapter 5 Summary In this chapter, we have learned that when working with a set of data it is important to choose an appropriate type of graphical display so that we can see what the data looks like. Bar graphs and pie charts are useful ways to display categorical data. Time plots are line graphs that help us to see how a given variable has changed over a specified period of time. And, when working with numerical data, we have learned how to make dot plots, stem plots, histograms, and box plots. It is also possible to make graphs so that comparisons can be made between more than one data set. Back-to-back stem plots and parallel box plots are two such types of graphs. The next step is to analyze the data set(s) by calculating numerical statistics. The statistics that give us an idea of where the center of the data is are the mean, median, and mode. These statistics are measures of central tendency, and give us an idea of where the 'average' of the data lies. The range, inner quartile range (IQR), and the standard deviation are all measures of the spread of a set of data. We have also learned how to calculate the five number summary, which divides a set of data into quarters and allows us to construct a box plot. Once the graphs are constructed and the statistics are calculated, we have learned to describe what these show. When describing a numerical set of data, in addition to explaining where the center and spread are, we also describe the shape of the distribution and whether or not any outliers are present in the data. The shapes that we focused on are symmetrical distributions and skewed distributions, remembering that the direction of the skew is toward the tail or outliers. We learned to make appropriate conclusions and comparisons that are based on the data, graphs, and statistics. Statisticians should avoid opinions and judgment statements as much as possible. We learned that the 1.5*(IQR) Criterion can be used to determine whether or not any data values are outliers. And, that the mean and standard deviation are easily changed, so these statistics are not the appropriate measures of center and spread when working with data that contains outliers or is skewed. Chapter 5 Review Exercises 1) Multiple-Choice: Which of the following can be inferred from this histogram? 35 ?,(} 25- 20 15- 10 5 < [ ' 2 4 8 10 12 14 225 www.ckl2.org a) The mode is 1. b) mean < median c) median < mean d) The distribution is skewed left. e) None of the above can be inferred from this histogram. 2) The owner of a small company is trying to determine whether he should go with a different company for his shipping needs. He needs to analyze the weights of the packages that his company ships out. This graph shows the distribution of the weight of packages that were shipped during the last month. Freqi jency Plot Check Sheet Package Weight o o o o o o o o o o o o o o o o o o o o o o o o 16.0 16.1 16.2 16.3 16.4 16.5 16.6 16.7 16.8 16.9 17.0 Weight in Ounces a) Calculate the mean, standard deviation, mode, and range for this data. Use a calculator for mean and standard deviation. b) Determine the five number summary for this data. Construct a box plot for this data. c) Which of these two graphs is more informative? Explain. d) He figures that he will save money with the new company on any packages that weigh less than 16.75 ounces. What percent of packages weighed more than 16.75 ounces? www.ckl2.org 226 3) After some bullying issues were brought to light in a big high school, a committee was formed to study the issue. A questionnaire was designed that contained several questions related to bullying and safety. A stratified random sample was selected that included students from all four grade levels. The table that follows shows the responses to one of the questions on the questionnaire. Student Responses to the Quest "t Fee! Safe at School" on: n = sample size Strongly Disagree Disagree Agree Strongly Agree 1003 133 274 529 67 a) Use Excel or Google Docs to create a pie chart that shows the results of this survey question. Be sure to include labels, percents, a title, and a key if needed. b) Describe what the graph shows in context. Be sure to include percents to support your observations. c) Comment on whether or not the committee should be concerned. Explain. In questions 4-7, match the distribution with the choice of the correct real-world situation that best fits the graph. 4) 8 5 a o o o o o o o o o ° 9 o 2 o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o O O Q Q O o o o 5 o o o o o o o o o o o o o o o o o o o o o o o o 227 www.ckl2.org 5) o o o o o ooo o o O O O OOO 00 o o o oooo o ooo oooooooooooooooooo 7) a) Andy collected and graphed the heights of all the 12 grade students in his high school. b) Brittany asked each of the students in her statistics class to bring in 20 pennies selected at random from their pocket or piggy bank. She created a plot of the dates of the pennies. c) Maya asked her friends what their favorite movie was this year and graphed the results. d) Jeno bought a large box of doughnut holes at the local pastry shop, weighed each of them, and then plotted their weights to the nearest tenth of a gram. www.ckl2.org 228 Questions 8-17 are multiple-choice questions. Select the best answer from the choices given. 8) Which of the following box plots matches the histogram? 12- 10 4- 2- i 1 1 1 1 1 1 1 1 [ — i 1 r 2 4 —i— A — i— B -i — C 229 www.ckl2.org 9) Identify the 5 number summary for this set of numbers: 12,356; 16,564; 15,684; 12,358; 15,987; 13,556; 18,564; 18,965; 19,683; 18,432; 18,563; 19,352 a) {12,356; 14,600; 17,498; 18,000; 19,683} b) {12,356; 14,620; 17,498; 18,764.5; 19683} c) {12,356; 14,650.5; 17,498; 18,700.5; 19683} d) {12,356; 14,683; 17,500; 18,800; 19683} e) {12,356; 14,695.5; 17,900; 18,888; 19683} 10) Thirty students took a statistics examination having a maximum of 50 points. The grade distribution is given in the following stem-and-leaf plot: 1 2 3 4 5 9 225 013335889 00136679 02244478 The median grade is equal to: a) 30.5 b) 30.0 c) 25.0 d) 28.5 e) 44.0 www.ckl2.org 230 11) Ms. Davis conducted a survey of the 44 students in her stats classes and asked how tall each student is in inches. Here is the five-number summary of the students' data: {57,64,67,69,79} Approximately how many people are shorter than 64 inches tall? a) 8 b) 21 c) 22 d) 11 e) 18 12) In which scenario(s) would it be better to use the 5-number summary versus the mean and standard deviation? a) a graph that is skewed b) a graph that is fairly symmetric c) a graph that is symmetric but has several high outliers d) Both choice (b) and (c) e) Both choice (a) and (c) f) All of (a), (b) and (c) 13) Suppose the lowest score on an English exam was 35% and the highest score was 90%. If the teacher of the class was to examine her students' test scores, which type of distribution would she prefer to see? One that is. . . a) skewed to the right b) skewed to the left c) fairly symmetric d) none of the above 231 www.ckl2.org 14) Several people were surveyed as they were leaving a movie theatre. Among other things, they were asked how much money they had spent. They answers were: $14, $17.50, $16, $16, $19.25, $12.75, $16, $37.75, $13.50 and $17. It was later discovered that the person who answered "$37.75" actually spent $17.75. Which of the following would not change as a result? a) the box plot b) the mean & the mode c) the median & the mode d) the standard deviation e) they all change 15) What does the following five-number summary tell you about the shape of the distribution? {5, 7.7, 9, 10.9, 24} a) skewed to the right b) skewed to the left c) symmetric d) uniform e) cannot determine 16) According to the 1.5*(IQR) Criterion, what are the two cut-off values for determining whether the data set in question #15 contains any outliers? a) 5 & 24 b) 7.7 & 10.9 c) 11.3 & 29.9 d) 4.5 & 14.1 e) 2.9 & 15.7 www.ckl2.org 232 17) A class survey was conducted to determine students' preferences. One question regarded favorite sport to watch on TV. The results are as follows: 9 said "football"; 12 said "hockey"; 5 said "basketball"; 6 said "baseball" and 3 said "other". What would the central angle be for "hockey" in a pie chart of this data? a) 65° b) 123° c) 90° d) 34° e) 111° 18) The following two graphs are based from the US Census Bureau, 2008 ('per capita' means per person). The dots represent actual data values, and the red curves represent models that can be used to predict future trends. Study the two graphs and answer the questions that follow. Eased on the number of eel phones in the US (US Census Bureau, 2003) arid accounting for population the number of cell phones per capita, cp, can be modeled by the following graph: c ellphones per capita in the US 1 2 -, — - * Cel ph ones per capita (data) 1 Cel ah Dnes per capita (model) y^ *=1 o.s a. •n ui w 0.6 a. / > 3 0.2 / — • • ••« -0.2 - 1 ' i 1 i i : ,985 1 990 1.995 2,000 2,005 2.010 2.015 Year 2.02 0.65 6 0.5 0.45 - = 1)4 ™ O =15 0.3 ■ 0.2S Landlines per capita in the US ^ 2 1.985 1.990 1.995 Landlines [data) Landlines [ post-cellphonp modial) Landlines [pre-cellphone model) I i i , i | , 2.000 2,005 Yaar 2.010 2,015 2.02C 233 www.ckl2.org Source:http://www. rationalfuturist.com August 2, 2011. a) What type of graphs are these? b) Describe the trend that each graph shows separately. This should be in context. c) Notice that the horizontal scales are the same. Compare and contrast the trends that are shown in the two different graphs in context. d) Approximately how many cell phones were there per person in 1997? In 2005? How many will there be, if the trend continues as the model indicates, in 2018? e) Approximately how many landlines were there per person at the peak? What year did this occur? If the trend continues as the model indicates, how many landlines will there be per person in 2015? www.ckl2.org 234 19) The AHS Tornadoes and the BHS Bengals are big rivals! Every year students try to prove that their school is better at sports than the other school. The table below shows the number of points scored by each school's basketball team during the last 15 games played. Table 5.12: Tornadoes Bengals 58 74 90 81 71 73 64 63 58 58 63 84 60 92 72 38 48 77 59 84 72 95 62 66 57 70 64 68 49 72 a) Construct a back-to-back stem plot for the data. b) Calculate the five number summary, mean and standard deviation for both teams. c) Construct parallel box plots for the data. d) Compare the two distributions. This should be done in context and include at least three distinct comparisons. e) What other information would you like to know when comparing these two basketball teams? Explain. 235 www.ckl2.org 20) Here are the percents of the population that are of Hispanic origin in each state according to the 2010 Census. State Percentage State Percentage State Percentage AL 3.9 LA 4.2 OH 3.1 AK 5.5 ME 1.3 OK 8.9 AZ 29.6 MD 8.2 OR 11.7 AR 6.4 MA 9.6 PA 5.7 CA 37.6 MI 4.4 RI 12.4 CO 20.7 MN 4.7 SC 5.1 CT 13.4 MS 2.7 SD 2.7 DE 8.2 MO 3.5 TN 4.6 FL 22.5 MT 2.9 TX 37.6 GA 8.8 NE 9.2 UT 13.0 HI 8.9 NV 26.5 VT 1.5 ID 11.2 NH 2.8 VA 7.9 IL 15.8 NJ 17.7 WA 11.2 IN 6.0 NM 46.3 WV 1.2 IA 5.0 NY 17.6 WI 5.9 KS 10.5 NC 8.4 WY 8.9 KY 3.1 ND 2.0 DC 9.1 Source:http://www. census. gov a) Construct a histogram. b) Calculate the five number summary. c) Identify any outliers. Use the outlier test. d) Accurately sketch a box plot. e) What is the range? The IQR? The mode? f) Calculate the mean and standard deviation. g) Compare the mean and the median, (i.e. which is larger? How different are they?) h) In this case would the 5#-summary or the mean & standard deviation be more appropriate? Why? i) Describe the distribution. Be thorough! Don't forget your S.O.C.C.S! (shape, outliers, center, context, & spread) j) According to the Census data, the percent of the total population of the United States that is Hispanic or Latino is 16.3%. Where does this number fit in the distribution of all of the states? Does it seem surprising? Why or why not? www.ckl2.org 236 21) An employer in Minneapolis was interested in determining how much money his employees were spend- ing on parking each week. An SRS of 50 employees was selected to complete a questionnaire about parking. Several questions were asked including where they park, how much they spend per week, how often they have difficulties finding spots, if they pay daily, weekly, or monthly, etc. The following table is the average weekly expenditure for parking for this sample of 50 employees. 20 40 22 22 21 21 20 10 20 20 20 13 18 50 20 18 15 8 22 25 22 10 20 22 22 21 15 23 30 12 9 20 40 22 29 19 15 20 20 20 20 15 19 21 14 22 21 35 20 22 a) Construct a split-stem plot b) Calculate the five number summary. c) Identify any outliers. Use the outlier test d) Accurately sketch a box plot, (to scale with labels) e) What is the range? The IQR? The mode? f) Calculate the mean and standard deviation. g) Compare the mean and the median. h) In this case would the 5#-summary or the mean & standard deviation be more appropriate? Why? i) Describe the distribution. Be thorough! Remember your S.O.C.C.S! Image References: Gasoline. http://www.education.vic. gov.au Cars. http://www.icoachmath.com School Lunch pictogram.http://alex.state.al.us Dot plot. http://cwx.prenhall.com Stem plot example. http://www.basic-mathematics.com Shapes of distributions. http://thesocietypages.org Weight of Jessica graph, http://www.stat.psu.edu 237 www.ckl2.org Crowded stem plot, http://illuminations.nctm.org/ Three histogram example, http://classes.cec.wustl.edu Package weight graph, http://flylib.com www.ckl2.org 238 Chapter 6 Analyzing Bivariate Data 6.1 Displaying Bivariate Data &5 Introduction In chapter 5 we learned how to analyze and describe univariate, or single-variable data. We explored ways to present our data visually with graphs and charts and how to analyze our data with numerical statistics. Also, we described our findings verbally and in context. Now we will be analyzing bivariate numerical data. This means two numerical values collected about each individual. Such bivariate data is often given in a table, or can be listed as ordered pairs. We will construct appropriate graphs, calculate numerical statistics and equations, and describe the relationship between the two variables in context. The purpose will be to explore whether or not a relationship or association exists between the two numerical variables. If an association does exist, statistics can be used to predict one variable based on the other variable. Learning Objectives • Construct and interpret scatterplots • Identify explanatory and response variables • Describe bivariate distributions in context — including strength, outliers, form and direction 239 www.ckl2.org Scatterplots ***** Scatterplots are graphs that represent a relationship between two variables. Two numerical values are measured about each individual being studied. When these two values become ordered pairs that are graphed on a coordinate plane, the resulting graph is called a scatterplot. We often suspect that one of these variables might explain, cause changes in, or help to predict the other variable. The explanatory variable is the variable that we believe may explain or affect the other variable. The explanatory variable is plotted along the x-axis. The response variable is the variable we believe may respond to, or be affected by, the other variable. The response variable is plotted along the y-axis. The response variable is often referred to as the dependent variable and the explanatory is referred to as the independent variable. Even though we often look for an explanatory-response relationship between the two variables, we can create a scatterplot even if no such relationship exists. J2. to 1 Of C o Q. 50 40 30 20 10 Scatterplot <^r + +♦ — i — 20 - 1 — 40 - 1 — 60 - 1 — BO 100 Explanatory Variable Example 1 State whether or not you suspect that there will be an explanatory-response association between each of the following pairs of data. If yes, identify the explanatory and response variables. a) A college professor decided to examine whether or not there is a relationship between the amount of time that a student studies and his or her score on the mid-term exam. At the end of the exam each student was asked to record the number of hours he or she had spent studying for the mid-term. The professor then made a scatterplot to examine the data. b) A different professor wanted to see whether or not there is an association between her students' heights and their IQ scores. She gave each of her students an IQ test and had her TA measure each student's height to the nearest inch. She constructed a scatterplot to examine the data. Solution a) It is reasonable to believe that the amount of studying does somehow have an effect on www.ckl2.org 240 students' exam scores. The explanatory variable is hours studying and the response variable is exam score. Often thinking in terms of a cause and effect relationship can help identify which variable is which. As a hint, try to determine if one of the variables comes first. If one comes first, then is most likely the explanatory variable. In our example, studying should come before the exam. b) It is not reasonable to believe that there is an association between height and IQ scores. Neither of these variables comes before the other and neither would be useful in predicting the other. However, even though we do not believe that there is an explanatory-response relationship between these variables, we can still construct a scatterplot Example 2 The following table reports the recycling rates for paper packaging and glass for several individual countries. It would be interesting to see if there is a predictable relationship between the percentages of each material that countries recycle. Construct a scatter plot to examine the relationship. Treat percentage of paper packaging recycled as the explanatory variable. Country A of Paper Packaging Recvcled /(■of Glass Packaging Recycled Estonia u 64 New Zealand 40 72 Poland 40 27 Cyprus 42 4 Portugal 56 39 United States 59 21 Itak 62 56 Spain 63 41 Australia 66 44 Greeee 70 34 Finland 70 56 Ireland 70 55 Netherlands 70 76 Sweden 70 100 France 76 59 Germany *3 Hi Austria 83 44 Belgium S3 m Japan 98 96 Figure: Paper and Glass Packaging Recycling Rates for 19 countries Solution We will place the paper recycling rates on the horizontal axis because we are treating it as the explanatory variable. Glass recycling rates are then plotted along the vertical axis. Next, plot a point that shows each country's rate of recycling for the two materials. Be sure to label your axes. 241 www.ckl2.org 100- o 80- o c 60- O ° e o 40- o °° • o o 20- o .■-■. • u ' — i 1 1 1 1 1 1 1 1 1 1 1 1 1 1 30 35 40 45 50 55 60 05 TO paper 75 SO 05 90 95 100 Notice that we do not always need to start at zero on either axis when making scatterplots. Describing Bivariate Data When we describe single variable data, we address several characteristics. We used the acronym S. O. C. C.S. to help remember to describe the shape, outliers, center and spread of a distribution. And, to be sure to do all of this in the context of the variables and individuals being studied. For bivariate data, we will again be discussing several characteristics in context. The important characteristics to describe when looking at the relationship between two numerical variables will be strength, outliers, form and direction. And, we will do this in the context of the variables and individuals being compared. The acronym that will help us to remember what to include in our descriptions is: S.C.O.F.D. (strength, context, outliers, form and direction) . When looking at a scatterplot, it is helpful to imagine drawing a line-of-best-fit through the data. A line- of-best-fit is a line that follows the trend of the data. It may go through some, all, or none of the actual points on the scatterplot. Do not actually draw such a line on your plot- just try to determine whether or not such a line would make sense, and if so, where it would fit. As you observe a scatterplot and imagine drawing such a line, you can ask yourself questions such as: How close to a line do the points lie? Would a curved pattern fit better? Are there points that would be far away from the line? Would the line have a positive or negative slope? etc. Strength Once you have constructed a scatterplot, you can examine the strength of the relationship between the two variables. The strength refers to how closely the points form a pattern. The more closely the points fit a pattern, the stronger the relationship between the variables. The more spread out and scattered the points are, the weaker the relationship. The first plot shows an extremely strong, linear pattern because the points form an obvious line. The second plot is more scattered so it is only moderately strong. And, www.ckl2.org 242 the third plot does not show much of a pattern at all, so it is very weak. Keep in mind that the association may be very strong, but not linear. We could find a very clear curved pattern in the data, for example. In the next section of this book we will learn about a statistic, called correlation, that measures the strength of the linear relationship between two variables. STRENGTH a • t ■" ■ m 9 t> e 1 '.- • * 1 V M * o * D as u » « m s» to tE. tc ?t m as « » tea VERY STRONG MODERATELY STRONG VERY WEAK In example #^ very weak. the relationship between paper and glass recycling rates for these countries is Context Do not forget that the graph, the numbers and equations, and the descriptions are all about something-its context. All of these elements should be described in the context of the variables and the individuals being examined. These graphs and statistics are not meaningless, they are about something! In example #4, the scatterplot explores the relationship between glass and paper recycling rates for several countries.. Outliers When examining a scatterplot, look for any data values that do not fit the pattern, or points that stand out from the rest of the data. An outlier will be a point that lies away from the rest of the data or one that seems to affect the strength of the relationship between the two variables. Many outliers will weaken the association between the variables, but they often would not significantly change where a line-of-best-fit would be drawn. An influential point is an outlier that actually seems to influence the line-of-best-fit. Imagine what the plot would look like without the point in question. If it would change the strength, then the point is an outlier. If it would change the slope of a line-of-best-fit, or where the line would be drawn, then the point is influential. 243 www.ckl2.org OUTLIERS 70 _ 60 — 50 - ■ 40 _ • • au _ ■ • ■ • ?n - • 10 _ • • * - • 1 1 1 1 1 1 1 1 10 12 14 10 12 14 OUTLIER & INFLUENTIAL OUTLIER (but not influential) In example #4, there seem to be some outliers. For example, Estonia and New Zealand have much lower paper recycling rates than their glass rates. Without these data values, the relationship would be stronger. Form **** Many scatterplots show a clear form or pattern. The first plot below shows a clearly linear. It is easy to imagine drawing a line-of-best-fit through these points. The second plot shows a clearly curved form. A line would not make any sense, so this is non-linear. The third plot shows a great deal of scatter among the points, so it has no form whatsoever. www.ckl2.org 244 FORM » 25 H 30 15 10 - 5 - I I I I » F + • :n ■ * * * * LINEAR NONLINEAR NO FORM In example #4, the scatterplot for paper and glass recycling rates shows a very weak linear form. The relationship is very weak, but no curved pattern is visible. If the outliers were removed, it would become more linear. Direction The direction of the graph is also important to mention. A graph that goes down to the right has a negative association. That is, as the explanatory variable increases, the response variable decreases. The first plot below has a negative relationship between the variables. A graph that goes up to the right has a positive association. That is, as the explanatory variable increases, the response variable also increases. The second plot shows a positive relationship between the variables. The third plot is an example of a graph that has neither a positive, nor a negative direction. If the relationship is linear and a line-of-best-fit is added to the graph, the slope of the line will be positive if the association is positive. And, the line will have a negative slope if there is a negative linear association between the two variables. DIRECTION « • NEGATIVE POSITIVE NO DIRECTION In example #4, the scatterplot for paper and glass recycling rates shows a positive association. As the paper recycling rate for these countries increases, so does the glass recycling rate. 245 www.ckl2.org Example 3 The following example is a scatterplot showing the weights and gas mileage for several cars. a) Identify the explanatory and response variables. b) Describe what the scatterplot shows. Be sure to address strength, context, outliers, form and direction (S.C.O.F.D.). Car Weight and Gas Mileage Gas Mileage • 26- ■ • • • 24- 22- • • 20- • • 18 • • * 3000 3500 4000 Weight of a Car Solution ***** n a) explanatory variable is: weight of the cars response variable is: gas mileage of the cars b) The relationship between these vehicles' weights and gas mileage is strong and very linear. There are no extreme outliers visible in the graph. The association between a vehicle's weight and gas mileage is negative. As the weight of the vehicles increase, the gas mileage of the vehicles decrease. www.ckl2.org 246 Example 4 The following scatterplot shows the data collected by the professor who wanted to see whether or not there is an association between her students' heights and their IQ scores. She gave each of her students an IQ test and had her TA measure each student's height to the nearest inch. Describe what the scatterplot shows. Be sure to address strength, context, outliers, form and direction (S.C.O.F.D.). Height vs. IQ 7S 73 « 65 § SO 41 ss 53 ±+± . *— ♦ ♦ • ♦ ♦ ♦ ♦ 80 90 iOO 110 120 130 I Q Score Solution There appears to be no relationship between height and IQ scores for these students. The graph has no form an no direction. Therefore, there are no outliers. The relationship has zero strength. There is no pattern or trend between IQ scores and students ' heights. 247 www.ckl2.org Section 6.1 Exercises 1) State whether or not you suspect that there will be an explanatory-response association between each of the following pairs of data. If yes, identify the explanatory and response variables. a) The number of semesters that students have been enrolled in college and the number of credits that they have earned. b) Students' grades on a statistics test and their weights. c) Employees' annual salary and the number of years that they have been employed by the company. d) The number of songs each person has on his or her IPod and the number of months that they have owned the IPod. 2) A college professor decided to examine whether or not there is a relationship between the amount of time that a student studies and his or her score on the mid-term exam. At the end of the exam each student was asked to record the number of hours he or she had spent studying for the mid-term. The professor then made a scatterplot to examine the data. Describe what the scatterplot shows. Be sure to address strength, context, outliers, form and direction (S.C.O.F.D.). Hours of study vs. Test scores 100 90 80 $ 7 ° 3 60 en E- 50 40 30 20 10 **_^ -^ +_ 5 10 15 20 25 30 35 Hours of study 3) Malia turned the water on in her bathtub full blast. She then measured the depth of the water every two minutes until the bathtub was full (and her mother started to freak out). Her findings are listed in the following table. www.ckl2.org 248 Time (minutes) Depth (cm) 2 7 4 9,5 6 ^M 14 8 19.5 10 21 12 24 14 ^M 32 16 36 is ^M 37.5 20 41 22 46 a) Identify the explanatory and response variables for this situation. b) Construct a scatterplot to show the results. c) Describe what the scatterplot shows. Be sure to address strength, context, outliers, form and direction (S.C.O.F.D.). 4)Several brands of peanut butter were rated for quality. The following graph compares the price per ounce and the quality rating for each of these brands of peanut butter. 249 www.ckl2.org Peanut Butter 35 — 30 — 25 — m 20 — <u CL 15 — 10 — 5 — — • • i i i i i i i i i 10 20 30 40 SO 60 70 80 90 Quality a) Identify the explanatory and response variables for this situation. b) Describe what the scatterplot shows. Be sure to address strength, context, outliers, form and direction (S.C.O.F.D.). 5) Mr. Exercise wanted to know whether or not customers continued to use their equipment after they purchased it. He contacted an SRS of his customers who had purchased an exercise machine during the past 18 months. His findings are summarized in the following table: www.ckl2.org 250 # months owned machine # hours exercise per 1 8 5 4.5 7 3 4 6 9 2 14 1.5 5 7 11 4 3 6.5 6 4 a) Identify the explanatory and response variables for this situation. b) Construct a scatterplot to show the results. c) Describe what the scatterplot shows. Be sure to address strength, context, outliers, form and direction (S.C.O.F.D.). 6) The following scatterplot shows the elevation and mean temperature for various location in Nevada. 251 www.ckl2.ors Nevada 25 n a 20- • • ■ E • • * , _ • H 15 - ■ P5 • • 3 •■p c 10 - < c n v 5 - • • ft > S n u [ ) 500 i i i 1000 1500 2000 Elevation (m) 2500 a) Identify the explanatory and response variables for this situation. b) Describe what the scatterplot shows. Be sure to address strength, context, outliers, form and direction (S.C.O.F.D.). www.ckl2.org 252 6.2 Correlation Learning Objectives Understand the properties of the linear correlation coefficient Estimate and interpret linear correlation coefficients Understand the difference between correlation and causation Identify possible lurking variables in bivariate data Understand the effects outliers and influential points can have on correlation The Correlation Coefficient The correlation coefficient is a statistic that measures the strength and direction of a linear relationship between two numeric variables. The symbol for correlation is r, and r can take any value from -1.0 to +1.0. The correlation coefficient (r) tells us two things about the linear relationship between the two variables, its strength and its direction. The direction of the relationship, positive or negative, is given by the sign of the r value. A positive value for r indicates that the relationship is positive (increasing to the right), and a negative r value indicates a negative relationship between the two variables (decreasing to the right). Bivariate data with a positive correlation tells us that as the explanatory variable increases, so does the response variable. And, bivariate data with a negative correlation tells us that as the explanatory variable increases, the response variable decreases. A correlation of zero indicates neither of these trends. The second thing that the correlation coefficient tells us is the strength of the linear relationship - how close the points are to forming a perfect line. An r of exactly 1 or -1 has a perfect correlation, the relationship forms a perfect, exact line. An r value of exactly +1 means that the relationship forms a perfect line with a positive slope and a r value of exactly -1 means that the scatterplot will show a perfect line with a negative slope. The closer the correlation value is to either +1 or -1, the stronger the linear relationship is. And, as r gets closer to zero (either positive or negative), the weaker the linear relationship is. It is important to note that this is only measuring the linear relationship between the two variables. If the relationship shows a clear curved pattern for example, the correlation will tell us nothing about the strength of the relationship. Here are some sample scatterplots with their correlation coefficients given: 253 www.ckl2.org r = -1 r = -.900 r = -0.536 r = -.160 20 15 ♦ ♦ 5 UO .m 5 5 ■ a. o in ♦ ♦ ♦ Z 4 5 i Explanatory variable 40 | 30 jj 25 + n, 20 1 " 3 » ■ 5 ♦ ♦ C 2 4 6 ExptanatDry Variable- a r = +0.S03 r = +1 We will be using either our calculator or a computer to calculate the correlation coefficient. The formula to calculate the correlation coefficient is quite tedious. It involves calculating the mean and standard deviation of all of the x- values and the mean and standard deviation of all of the y- values. It then compares the x- value of each ordered pair to the mean of x and every y- value to the mean of y (by subtracting and then dividing by the standard deviation), multiplies these newly calculated values, adds all of them, and divides by one less than the sample size. The correlation formula is shown below, but we will be using technology rather than calculating by hand. www.ckl2.org 254 g** Correlation Coefficient Formula: (xj - x)\ f(y t - y)' Z fcfi) ( . n — 1 Example 1 Estimate the correlation coefficient for each of the following scatterplots. Nevada 25- a. E £ IS - !»■ c I ■■ * * • ••* * • • n . i 1 500 1QCD 1500 2000 EltV3tt&n fin> 250D Height vs. IQ *,♦* + ■^■i * * .♦ •r */.. * * +* * * w- - BO 90 IM J10 120 lid Qfan Solution Nevada: The correlation will be negative and fairly strong, so my estimate is r * -0.85. Height & IQ: There seems to be no pattern to the graph, so my estimate is r « 0. Properties of Correlation When considering using correlation as a measure of the strength between two variables, you should construct and examine a scatterplot first. It is important to check for outliers, be sure that the relationship appears to be linear, be sure that your sample size is sufficient, and consider whether the individuals being examined 255 www.ckl2.org were too much alike in some way to begin with. Thus, when examining correlation, there are four things that could affect our results: outliers, linearity, size of the sample and homogeneity of the group. An outlier, or a data point that lies outside of our overall pattern, can have a great effect on correlation. How great of an affect is determined by the sample size of the data and by the magnitude by which the outlier lies outside of the pattern. The three plots below show scatterplots with their correlation coefficients (r). The first plot shows a positive and reasonably linear graph. Its correlation is r = .897, which is positive and fairly strong. The second plot shows the same data as plot one, with one outlier (upper left) added. Its correlation has dropped to r = .374, which is still positive, but much weaker. This demonstrates how outliers can bring the correlation closer to zero. However, some outliers can actually strengthen the correlation. This is demonstrated in the third plot, which shows the same data as the first with one outlier (upper right) added. With this outlier, the linear relationship becomes even stronger than the first plot, at r = .973. r = 0.897 r = 0.374 r - 0.973 i p ! D ° D IB ° i ° p D U ° !■.-.... 1; □ □ | ■ ■ " If the relationship is not linear, calculating the correlation coefficient is meaningless. It is only testing the linear relationship between the two variables. Imagine a scatterplot that shows a perfect parabolic relationship. We would know that there is a strong relationship between these two variables, but if we calculated the correlation coefficient, we would arrive at a figure around zero. Therefore, the correlation coefficient is not always the best statistic to use to understand the relationship between variables. As we discussed in experimental design, a small sample size can be misleading. It can either appear to have a stronger or weaker relationship than is really accurate. The larger the sample, the more accurate of a predictor the correlation coefficient will be on the linear relationship between the two variables. When a group is too much alike in regard to some characteristics (homogeneous), the range of scores on either or both variables is restricted. For example, suppose we are interested in finding out the correlation between IQ and salary. If only members of the Mensa Club (a club for people with IQ's over 140) are sampled, we will most likely find a very low correlation between IQ and salary since most members will have a consistently high IQ, but their salaries will vary. This does not mean that there is not a relationship - it simply means that the restriction of the sample limited the magnitude of the correlation coefficient. Lurking Variables It is very important to know that a high correlation does not mean causation! Often times studies that showing a high correlation between two variables will influence readers into thinking that one variable is the cause of the relationship. This is not always true! A high correlation simply does not prove that one www.ckl2.org 256 variable is causing the other. In some situations we would agree that one variable is in fact causing the response in another. The best way to prove such a direct cause-and-effect relationship is by carrying out a well designed experiment. For example, smoking is strongly correlated with lung disease, and, based on much scientific evidence, we can now say that cigarette smoking causes lung disease. However, this topic was highly debated for many years before the surgeon general announced that it was accepted that cigarette smoking causes lung cancer and emphysema. Many people refused to accept this for many years. People who stood to lose money if smoking was proven to be unsafe, suggested every possible other explanation that they could think of. They suggested that it was simply a coincidence, or that all people who choose to smoke might have something else in common that was actually the cause of the lung disease, not the cigarettes. Because it was not ethical to experiment on humans in order to prove the direct cause-and-effect relationship, the debates went on for a long time. Congress mandated that the Surgeon General's warning labels appear on all cigarette packaging sold in the U.S. beginning in January 1966. Since 1972, the Surgeon General's warning labels have appeared on U.S. cigarette advertising as well. The Surgeon General's warnings required by the Cigarette Labeling and Advertising Act of 1965 have been amended over time, and the Act currently requires cigarette manufacturers and importers to print the following warnings, which are rotated on a quarterly basis, on cigarette packaging and advertisements: SURGEON GENERAL'S WARMING: Smoking Causes Lung Cancer. Hean Disease, E mphyse ma. And May Complicate Pregnancy. "H l~" I SURGEON GENERALS WARNING. Quilting Smoking Now Greatly Reduces Serious Risks to Your Health. .J L. ~i r l i I i ! I I J l__ SURGEON GENERAL'S WARNING: Smoking By Pregnant Women May Result in Fetal Injury, Premature Girth, And Low Birth Weight. ~i r- I SURGEON GENERAL'S WARNING: Cigarette Smoke Contains Carbon Monoxide. Sometimes the relationship between variables is a cause-and-effect one, but many times it can be simply a coincidence that the two variables are highly correlated. It is also possible that some other outside factor, a lurking variable, is causing both variables to change. A situation where we have two variables that are both being affected by some other, outside, lurking variable is called common response. For example, we can show a high correlation between the number of TV's per household and the life expectancy per person among many countries. However, it makes no sense that TV's cause people to live longer. Some lurking variable is having an effect here. It is likely that the economic status of the countries is causing both variables to change: more money means more TV's and more money means better health care. If a country is wealthy it is much more likely to have citizens who own TV's. Also, if a country is wealthy it is much more likely to have good hospitals, roads, health education, access to clean water and food, all things that contribute to longer life. In some situations we will have two variables that are highly correlated, but we are unsure of the exact cause of the relationship. We may be unclear as to whether or not one is causing the other, if there is a lurking variable causing a common response, or if there is some unknown lurking variable that is related in some other unknown way (lurking variables are not always obvious to the researchers). Such a situation is called confounding, because it is confusing to determine how the variables are related (if at all), and whether there may be some lurking variable and if it is related to the variables in question. The variables seem all mixed up and the relationship is unclear, even if highly correlated. An example of confounding is global warming. This is a highly debated topic in social media and web-blogs. Some people argue that human pollution is a major cause of the increase in CO2 and other green house gasses in the atmosphere. While others argue that it is a part of a natural cycle that has normally occurred in our Earth's history. Still some may think both explanations are at work. This is an example of confounding because there is confusion about the cause of global warming. 257 www.cki2.0rg And don't forget that some relationships are occurring completely by chance, and their high correlation is then just a coincidence. For example, if you researched divorce rates and gas prices over the past 50 years you may note that both have gone up. A scatterplot comparing divorce rates and gas prices would show a strong positive relationship. The correlation would likely be a high, positive value. However, it makes no sense that divorce rates are causing high gas prices. It also is unlikely that there exists a common response or some form of confounding. So in this case, we would say that this is a relationship that is best explained by sheer coincidence. Example 2 Suggest possible lurking variables to explain the high correlations between the following variables. Explain your reasoning. Consider whether common response, confounding, or coincidence may be involved. a) It has been shown that cities with more police officers also have higher numbers of violent crimes. Does this mean that more police officers are causing more violent crimes to occur? b) Over the past 25 years, the percent of parents using car-seats has increased significantly. During this same time period, the rate of DUI arrests has also increased significantly. These two variables, when graphed, show a very high, positive correlation. Does this mean that car-seat use is causing DUI's to increase? c) A study published in USA Today claimed that, "Teens who text a lot [are} more likely to try sex, drugs, alcohol." Does this mean that texting causes teens to try sex, drugs and alcohol? Could we then limit teen behaviors such as these by canceling their texting plans? Solutions a) It makes no sense that the number of police officers would be causing the violent crime to occur. It is much more likely that it the reverse, that communities with high numbers of violent crimes need higher numbers of police officers. It is also probable that both variables increase in cities with higher populations. Due to the fact that we can think of more than one possible lurking variable and it is difficult to know how all of these variable actually relate, we would say that this is an example of confounding (the variables in question and the lurking variables are all mixed up). b) It is clearly ridiculous to think that car-seat use is causing an increase in the rate of DUI's. It also makes no sense that DUI's cause car-seats to be used. It may be simply a coincidence that these are both increasing. Or, perhaps there has been in increase in law enforcement for both over this time period. The awareness of the dangers of both have increased over the past 25 years, so maybe this is an example of common response. Or, maybe many factors contribute to the increase of both, so perhaps this is an example of confounding. But, no matter what, this is not cause-and-effect. c) It is unlikely that texting is actually the cause of these behaviors. There is most likely some other, lurking variable(s) that are the cause(s). One probable lurking variable, when it comes to teenagers, is the parents. Perhaps this is an example of a common response to parents who are not very involved in their teens' lives. Parents who are not very involved would not be aware that their teen is texting too much and would also not be aware of what choices their teen is making during his or her free time. Perhaps teens who spend a lot of time unsupervised would be more likely to text and would also be more likely to try sex, drugs, and alcohol. All of www.ckl2.org 258 these behaviors might be a common response to not having parents who prohibit or limit teens from doing these things. Canceling texting plans would have little to no affect on other teen behaviors. See the link for more information on this report at: http://www.usatoday.com/yourlife/sex- relationships/2010-ll-10-texting-teens N.htm Multimedia Links: Calculating Correlation on the Internet, There are several websites where you can enter in data points and find their correlation one of them is below. http://easycalculation.com/statistics/correlation.php If this site no longer works, trying googling "finding correlation applet" and see what you get for results. For an explanation of the correlation coefficient, see kbower50, The Correlation Coefficient (3:59). Another, more lighthearted example of Correlation + Causation can be found at the following website, which discusses the evil of the pickle. http://www.exrx.net/ExInfo/Pickles.html For a better understanding of correlation try these fun links below, http://www.istics.net/stat/Correlations Match the graph to its correlation. http://www.rossmanchance.com/applets/guesscorrelation/GuessCorrelation.html Guess the cor- relation Guess the correlation 259 www.ckl2.org Section 6.2 Exercises 1) What are the two things that the correlation coefficient measures? 2) The program used to create this scatterplot found the line-of-best-fit and reported the r-squared value as r 2 = 0.805 for the relationship between arm-span and height for several individuals. What is the correlation coefficient? Is it positive or negative? Explain how you know. 185.0' 180.0- 175.0- E 170.0- a ■ 165.0- 160.0' 150.0' R Sq Lineal = 0.805 160.0 180.0 arm-span (cm) 3) During the summer Ms. Statsteacher lets her two daughters stay up later than during the school year. Their bedtimes during the summer range from 8:30 p.m. to 12:30 a.m. She has discovered that her older daughter Reily will wake up between 8:00 and 9:00 a.m. no matter what time she goes to bed. However, her younger daughter Neila tends to wake up later after she gets to stay up later, and earlier when she goes to bed earlier. Neila has been known to wake up anytime between 8:00 and 11:45 a.m. a) Sketch a separate (approximate) scatterplot for each daughter, that compares time going to sleep and time waking up. Which will be explanatory and which will be response? b) Which of these do you think will best approximate the correlation for Reily? A. close to r = +1 B. close to r = + .75 C. close to r = D. close to r = -.75 E, close to r = -1 c) Which of these do you think will best approximate the correlation for Neila? www.ckl2.org 260 A. close to r = +1 B. close to r = +.75 C. close to r = D. close to r = -.75 E. close to r = -1 4) Suggest possible lurking variables to explain the high correlations between the following variables. Explain your reasoning. Consider whether common response, confounding, or coincidence may be involved. a) As ice cream sales increase, the rate of drowning deaths increases sharply. Does this mean that ice cream causes drowning? b) With a decrease in the number of pirates, there has been an increase in global warming over the same time period. Does this mean global warming is caused by a lack of pirates? c) The higher the number of fire-fighters fighting a fire, the more damage done by the fire. Does this mean that we can limit damage by sending fewer fire- fighters to fires? d) Suppose that each of the hockey players on the high school team supplies his or her own hockey stick, with varying degrees of flex. The assistant coach has been keeping a record of the degree of flex for each player's stick and their respective point totals (goals and assists). He has noted that there is a strong, negative correlation between these two variables. In other words, the players with less flex in their sticks are scoring more points and those with more flex are scoring fewer points. Can we then give players less flexible sticks and expect to increase scoring? 5) In a recent study in Resource Manual, it was noted that divorced men were twice as likely to abuse alcohol as married men. The authors concluded that getting divorced caused alcohol abuse. Do you agree? Explain your reasoning. 6) A commercial for a new diet pill claims "You will lose weight while you sleep! No exercise needed!". They then show several before-and-after photos of people who have lost weight. People who were obese are now very buff. They then give the information for you to order the pills ("for three payments of just $19.95 each, plus shipping and handling"). Is this proof that these diet pills caused these people to lose weight? Suggest possible lurking variables. Explain your reasoning. 7) Match each graph with its correlation coefficient: 261 www.ckl2.org GRAPH#1 GRAPH #2 GRAPH #3 GRAPH #4 GRAPH #5 Match the Correlation with the graph: A. r = 0.941 B. r = 0.850 C. r = 0.321 D. r = -0.598 E. r = -0.938 8) A correlation of r = indicates no linear relationship between the two given variables. But, this does not mean that there is no relationship between the two variables. Sketch a scatterplot in which there is a strong relationship between the variables, but the correlation would be near r = 0. 9) Use the "Beach Visitors" scatterplot to answer the questions that follow. www.ckl2.org 262 Beach Visitors 600 ■ 525 - B 450- o 375- S 300 - > 225, 150- 75- i | *• • | 4 V | • W !! A 1^ < • 1 ■- 80 84 88 92 96 Average Daily Temperature (*F) a) Identify the explanatory and response variables. b) Estimate the correlation coefficient for the graph. c) Describe what the scatterplot shows, (remember S.C.O.F.D) 263 www.ckl2.org 6.3 Least-Squares Regression 'm Learning Objectives Construct scatterplots using technology Calculate and graph the least-squares regression line using technology Calculate the correlation coefficient using technology Use the LSRL to make predictions Understand interpolation and extrapolation Interpret the slope and the y-intercept of the LSRL Least-Squares Regression In the last section we learned about the concept of correlation, which we defined as the measure of the linear relationship between two numerical variables. We saw that when the points of a scatterplot formed a clear linear pattern, then the points were said to have a high correlation. As a reminder, we can have a strong correlation in either a positive (increasing to the right) or a negative (decreasing to the right) direction. We have also discussed the idea of drawing a line-of-best-fit through the data. In some scatterplots this is easy to do and all of us would end up with our lines in nearly the same place. However, if everyone were to simply draw a line where they think it fits or to select two of the points to calculate a line through, our lines and equations would certainly vary from person to person. Therefore, we will use a specific formula to calculate the equation for the line-of-best-fit. Linear regression involves using data to calculate a line that best fits the data and then using that line to predict scores. We will use the Least-Squares Regression Line (LSRL)- the line that makes the sum of the squares of the vertical distance of each data point from the line the least possible. This is the standard regression equation that is used most often. It is the one that your graphing calculator and Excel will calculate for you. The formula and process to calculate this is quite tedious, so we will use technology to find the LSRL equations. The regression equation will be in the form of: y = a + bx, where a is the y-intercept and b is the slope of the equation. Your calculator will calculate the correlation coefficient (r) at the same time as it calculates the LSRL equation. It will also report a value for r 2 (which is exactly what it says; r-squared). The r 2 value is called the coefficient of determination, but we will not be addressing its importance in this course. To calculate the LSRL equation and correlation coefficient, use a graphing calculator or computer program. See the appendix at the end of this book for the steps to calculate the LSRL and correlation. www.ckl2.org 264 Least- ■Sqi jares Regression Equation y — a + bx X = = the explanatory variable y = the predicted response variable a =the y- intercept (or the value ofy, when x = 0) b = the slope {or the rate of change in yfor each increase of one unit in the x direction) Interpreting the slope and y-intercept As with all of our statistics, these data, graphs and equations are not meaningless. They represent the relationship between two numerical values measured on several specific individuals. Thus the slope and the y-intercept of our newly calculated regression equation mean something as well. So, we will be interpreting both in context. The interpretation of the slope of the regression equation is the rate of change in the response variable (y) , for each increase of one unit of the explanatory variable (x) . You will say something like: For each increase of one (explanatory variable), there will be (an increase or decrease) of (slope value) in the (response variable). The interpretation of the y-intercept of the regression equation is the value of the response variable (y) when the explanatory variable (x) is zero. You will say something like: When (explanatory variable ) is zero, the (response variable) is (y-intercept value). You will discover that the interpretation of the y-intercept often makes absolutely no sense when put into context. This is because actual data rarely involves x- values of zero. 265 www.ckl2.org Example 1 Below is data given by a canine expert. It relates a dog's age in years to what they believe the equivalent age in human years to be. Dog Age (in Years) Equivalent Human Age (in Years) 0.5 5.5 1 10.5 2 19 3 24 4 29 5 33 7 41 8 45 10 53 11 57 The scatterplot showing this data, using dog age as the explanatory variable, is below. go JJ 50 C ra 40 E t 30 2C C _u ra .3 io D IT hi O Dog Years! ♦_! 4^ ^-* ♦ — ♦ 4 6 3 Dog Age in Years 10 12 www.ckl2.org 266 a) Calculate the Least-Squares regression line for the Dog Year Data. Report you equation. Be sure to identify your variables. b) Calculate the correlation (r). What two things does r tell us about this relationship? c) Identify and interpret the slope in the context of the problem. d) Identify and interpret the y-intercept in the context of the problem. Solution Dog Years! y = 7.7947 + 4.6418* R 2 = 0.9815 Dog Age in Years 12 aJThis was done using Excel, but the graphing calculators will report the same LSRL. LSRL is: y = 7.795 + 4.642* x = Dog age in years y = equivalent human years (predicted) b) r will be the square-root of r 2 (The graphing calculators report both r and r 2 so you would not need to do any calculating, but Excel only gave r 2 ). r = V^ 2 = V0.9815 = 0.9907 The two things that r tells us are: Because r is positive, this relationship is increasing. And r is very close to one, so this relationship is very strong. c) The slope is 4-642. It means that for every increase of one year in dog age, there is an increase of 4-642 years in the equivalent human age. d) The y-intercept is 7.795. It means that if a dog were zeros old, it would be 7.795 years in human years. (This is clearly nonsense in this case!) ***** 267 www.ckl2.org Making predictions The main use of the regression line is to predict values. After calculating this line, we are able to predict values by simply substituting a value for the explanatory variable (x) and solving the equation for the predicted response value (y). In our example above, we can predict that the human year equivalence for a dog that is 6 years old is approximately 35.6 human years (see equation below). This prediction is reasonable and it matches with our graph. However this is not always the case. y = 7.795 + 4.642(6) = 35.647 As you look at the LSRL drawn on the above scatterplot, you can see that the points to the far left do not appear to be very linear. So, using the line to the left of about 1 year will not make much sense. Also, we do not have any idea what will happen to the data beyond the 11 years that we have recorded. An LSRL is very useful in making predictions, but only within the range of the actual data that we have collected and can see- this is called interpolation. We can see that this line is a reasonably good fit between 1 and 11 dog years, but we simply do not know what happens beyond 11 years (and we cannot use negative years for obvious reasons). The prediction line that we have calculated will go forever in both directions (remember geometry?), but it will not be appropriate to use it to predict for all values of x. Using a regression line to predict values that are outside the range of our actual data is called extrapolation. Extrapolation will often yield ridiculous answers! However, even if the result seems reasonable, we should avoid extrapolating because we simply do not know what happens beyond our actual observations. Making decisions based on extrapolating can be dangerous as we are coming to conclusions that are not backed up by data. Example 2 The following table lists the GPA and Verbal SAT Score for seven students. Analyze how well Verbal SAT Scores can be used to predict students' GPAs based on this data. Student Verbal SAT Score GPA Anna 595 3.4 Bryce 520 3.2 Corbin 715 3.9 Delia 405 2.3 Emilio 680 3.9 Frankie 490 2.5 Geraldine 565 3.5 a) Construct a scatterplot on your graphing calculator (or computer). Sketch the graph that the calculator shows. Be sure to label your axes. b) Calculate the Least-Squares Regression Line (LSRL) using your calculator. Report your equation. Be sure to identify your variables. www.ckl2.org 268 c) Calculate the correlation coefficient (r). Report it here. What are the two things that this number tells us about this graph? d) Identify and interpret the slope in the context of the problem. e) Using your equation, what is the predicted GPA of a student who has a Verbal SAT Score of 500? Of a student with a score of 600? Solution a) Construct a scatterplot on your graphing calculator (or computer). Sketch the graph that the calculator shows. Be sure to label your axes. Here is the scatterplot from a TI-84 plus: < a. Verbal SAT Here are the LSRL, correlation, and the scatterplot with the line added to the graph, from a TI-84 plus: LinReg y=a+bx a=. 0974125533 b=. 095546124 r 2 =.3962047137 r=. 9466303933 b) Calculate the Least-Squares Regression Line (LSRL) using your calculator. Report your equation. Be sure to identify your variables. LSRL is: y = 0.097 + 0.0055x 269 www.ckl2.org x = Verbal SAT Score y = predicted GPA c) Calculate the correlation coefficient (r). Report it here. What are the two things that this number tells us about this graph? The correlation is r = +0.9467. This tells us that the relationship is positive and strong. d) Identify and interpret the slope in the context of this problem. The slope is 0.0055. This tells us that for each increase of 1 point on the Verbal SAT Score, there will be an increase of 0.0055 in a student's GPA. e) Using your equation, what is the predicted GPA of a student who has a Verbal SAT Score of 500? Of a student with a score of 600? y = 0.097 + 0.0055(500) = 2.847 y = 0.097 + 0.0055(600) = 3.397 So, the predicted GPA for a student who scores 500 on the SAT Verbal, is approximately 2.8. And, the predicted GPA for a student who scores 600 on the SAT Verbal, is approximately 3-4- Outliers and Influential points An outlier is an extreme observation that does not fit the general pattern of the data (see the example below). Because an outlier is an extreme observation, the inclusion of it may affect the correlation, and the equation for the least-squares regression line. When examining a scatterplot and calculating the regression equation, it is worth considering whether extreme observations should be included or not. 10000 8000 6000 4000 2000 V 20 40 60 90 100 Let's use our GPA example to illustrate the effect of a single outlier. Suppose that we have a student who has scored very high on the SAT Verbal exam, but has a lower GPA. We will change Corbin's results to be 715 on the SAT and a GPA = 2.2, and see what happens to the LSRL and correlation. www.ckl2.org 270 Student Verbal SAT Score GPA Anna 595 3.4 Bryce 520 3.2 Corbin 715 2.2 Delia 405 2.3 Emilio 680 3.9 Frankie 490 2.5 Geraldine 565 3.5 Here are the LSRL equation and the correlation coefficient recalculated with Corbin's GPA changed: ■ D D D < a. ■ D D D D ■ Verbal SAT LinReg y=a+bx a= 1.395643281 b=. 091 9472285 r* = . 1093117698 r=. 3167203337 As you can see, this one change turned Corbin into an outlier. This caused the correlation to drop from r = 0.947, all the way down to r = 0.317. This is a huge change- it makes the relationship between the two variables extremely weak (rather that very strong). Also, this changed both the slope and the y-intercept of the LSRL equation dramatically. This means that predictions based on this LSRL will have different results than those based on the LSRL with Corbin's old GPA. There is no set rule when trying to decide how to deal with outliers in regression analysis, but you can now see how an outlier really can change everything when it comes to scatterplots, correlation and least-squares regression. Be sure to mention any potential outliers that you observe in any scatterplot. 271 www.ckl2.org Section 6.3 Exercises 1) Malia turned the water on in her bathtub full blast. She then measured the depth of the water every two minutes until the bathtub was full. Her findings are listed in the following table. In an section 6.1 we constructed a scatterplot and described the plot, we are now going to analyze this data further. Time (minutes] Depth (em) 2 7 4 9.5 6 14 8 19.5 10 21 12 24 14 32 16 36 IS 37.5 20 4-1 22 46 a) Construct a scatterplot on your graphing calculator (or computer). Sketch the graph that the calculator shows. Be sure to label your axes. b) Calculate the Least-Squares Regression Line (LSRL) using your calculator. Report your equation. Be sure to identify your variables. c) Calculate the correlation coefficient (r). Report it here. What are the two things that this number tells us about this graph? d) Identify and interpret the slope in the context of the problem. e) Using your equation, what is the predicted depth of the water after 17 minutes? After one hour? f) Are your answers in (e) reasonable? Why or why not? 2) The following table shows the progression of the Federal Minimum Wage in the United States since 1938 (source:http://www. laborlawcenter.com). We are going to analyze the relationship between year and minimum wage to see if there is a predictable relationship between the variables. www.ckl2.org 272 FEDERAL MINIMUM WAGE HISTORY Effective Date 10/24/1938 10/24/1939 10/24/1945 01/25/1950 03/01/1956 09/03/1961 09/03/1963 02/01/1967 02/01/1968 05/01/1974 01/01/1975 01/01/1976 01/01/1978 01/01/1979 01/01/1980 01/01/1981 04/01/1990 04/01/1991 10/01/1996 09/01/1997 07/24/2007 07/24/2008 07/24/2009 hourly Wage $0.25 $0.30 $0.40 $0.75 $1.00 $1.15 $1.25 $1.40 $1.60 $2.00 $2.10 $2.30 $2.65 $2.90 $3.10 $3.35 $3.80 $4.25 $4.75 $5.15 $5.85 $6.55 $7.25 a) Using year only as the explanatory variable (ignore month & day), construct a scatterplot. Sketch the graph that the calculator shows. Be sure to label your axes. b) Describe the relationship between the two variables. (S.C.O.F.D.) c) Calculate the Least-Squares Regression Line (LSRL). Add the line to your graph and report your equation. Be sure to identify your variables. d) Calculate the correlation (r). Even though r is very high, do you feel that a line is the best model for this data? Why or why not? e) Based on your model, what would you predict the Federal Minimum Wage to be in 2012? Is this an accurate prediction? Why or why not? 273 www.ckl2.org f) Based on your model, what would you predict the minimum wage to have been in 1968? How close is this to the actual minimum wage that year? 3) Suppose that some researchers analyzed the relationship between fathers' and sons' IQ scores for a group of men. Suppose further that they discovered that the relationship was reasonably linear and they calculated a regression line of y = 12 + 0.9x ; where x= father's IQ and y = son's IQ. a) Identify the explanatory and response variables. b) Identify and interpret the slope in the context of the problem. c) Identify and interpret the y-intercept in the context of the problem. d) Do your answers to (b) and (c) seem reasonable? Why or why not? e) What would you predict a son's IQ to be if his father has an IQ of 120? What if the father had an IQ of 140? f) If you knew that the original data included fathers with IQs from 108 to 145, explain why it would be inappropriate to use your model to predict a son's IQ if his father's IQ were 170. www.ckl2.org 274 6.4 More Least-Squares Regression Learning Objectives • Construct scatterplots using technology • Calculate and graph the least-squares regression line using technology • Calculate the correlation coefficient using technology • Use the LSRL to make predictions • Understand interpolation and extrapolation • Interpret the slope and the y-intercept of the LSRL Multimedia Links For an introduction to what a least squares regression line represents, see bionicturtled.com, Introduction to Linear R (5:15). http://www.youtube.com/watch?v=ocGEhiLwDVc For an applet that will calculate correlation and the least squares regression line, see http: //illuminations. nctm.org/lessonDetail. aspx?ID=L456 275 www.ckl2.org Section 6.4 Exercises 1) Mr. Exercise wanted to know whether or not customers continued to use their equipment after they purchased it. He contacted an SRS of his customers who had purchased an exercise machine during the past 18 months. His findings are summarized in the following table. We began to look at his data in section 6.1. We are now going to analyze it further. # months owned machine I 1 & 5 4.5 7 3 4 6 9 2 14 1.5 5 7 11 4 3 6.5 6 4 a) Construct a scatterplot. Calculate the LSRL and add it to your graph. Sketch your graph and report your equation. Be sure to identify your variables. b) Identify and interpret the slope in the context of the problem. c) Identify and interpret the y-intercept in the context of the problem. d) What is the correlation coefficient? What are the two things that this statistic tells about the relationship between these two variables? e) Based on your model, how many hours would you predict a person who has owned the machine for 12 months to exercise? 5 months? f) Based on your model, if a person claims to exercise 9 hours per week, how long would you suspect that they had owned the machine? 2) A college professor was becoming annoyed by how many of his students were absent during his 8:00 a.m. section of Philosophy 103. He decided to analyze whether these absences were affecting students www.ckl2.org 276 learning the material or not. He assigned his TA the task of keeping track of attendance. At the end of the semester he compared students' grade on the final exam with the number of times they had been absent. His findings are displayed in the following graph. Absenses & Final Exam Scores 90 - ♦ 4- y = 91.704 -l.hSAx E 30 - R 2 =0.8732 LLI 70 - "ra BO - ♦ c L_ c o 50 - ■w - ♦ 7"^-^ ♦ '11 30 - * ^\. <fl 1_ 20 - 10 - c 1 1 i i i 10 20 30 40 Number of Absense (87 total days] sc a) Identify the explanatory and response variables. b) Describe the relationship between these two variables (S.C.O.F.D). c) Jeremy was absent 25 times. What would you predict his score on the final exam to be? Lucy overslept and missed 43 classes. What would you predict for her score on the final? d) Calculate the correlation coefficient (r). What two things does this statistic tell you about the association between these two variables? (Hint: you were given R 2 ) e) Interpret the meaning of -1.654 in the context of this problem. 3) The following table shows the grade level and reading level for 5 students. Treat grade level as the explanatory variable as you do the following. 277 www.ckl2.org Grade vs. Reading Student j Grade Reading Level 1 A ! 2 j 7 | B 6 | 14" ! i_ ■ |C 5 | 12 ! 1 D 4 ; 9 | E 1 [ 4 a) Use your calculator to create a scatterplot. Then calculate the LSRL and the correlation coefficient for this data. Report your findings. What if it was found that student E was actually in grade 8? How would this affect the LSRL and/or the correlation? Grade vs. Reading Student Grade Reading Level j A 2 7 | B 6 14 1 | C 5 ! 1 12 | ! b 4 ! 1 9 ! ! ' E j 3 4j b) Use your calculator to create a new scatterplot. Then calculate the LSRL and the correlation coefficient for the changed data. Report your findings. c) What changes do you notice between your answers to (a) and (b)? Explain why these changes occurred. 4) The table below shows the nutritional information for Taco Bell Burritos as reported on the website: http://www.tacobeil.com. Choose two of the variables to analyze (avoid using trans fat & sugars). www.ckl2.org 278 item 3 1 f I s I 1 1 Bpfl 3 S 1 } •- < i 1 m ] 1 " A 1 f 5 u a 5 3 I E 3 0) & e ■o a* o ! Burritos 1/2 lb.* Cheesy Potato Burrito 248 540 230 7 26 0.5 45 1360 59 1 1/2 lb.* Combo Burrito 241 460 160 7 18 0.5 A ■5 1330 53 c 7-Layer Burrito 283 500 160 6 18 2 1090 69 1 Bean Burrito 198 370 90 3.5 10 5 980 56 1 Beefy 5-Layer Burrito 245 540 190 8 22 35 1280 68 c Beefy Nacho Burrito 186 470 180 6 20 30 990 58 A Burrito Supreme® - Chicken 248 400 110 5 12 40 1060 51 1 Burrito Supreme® - Steak 248 390 110 5 13 30 1100 51 1 Burrito Supreme®- Beef 248 420 140 6 16 35 ^^^n 53 c Chili Cheese Burrito 156 380 150 8 17 0.5 35 930 41 s Fresco Bean Burrito 213 350 70 2.5 8 990 57 1 Grilled Chicken Burrito 177 430 170 5 18 35 870 48 Z XXL Grilled Stuft Burrito - Beef 445 880 370 14 42 1 75 2050 95 1 XXL Grilled Stuft Burrito - Chicken 445 840 310 11 35 85 1970 92 1 XXL Grilled Stuft Burrito - Steak 445 820 320 12 36 ^H 70 2050 92 1 a) What will you be using as your explanatory and response variables? b) Construct a scatterplot. Label your axes. c) Describe the association (S.C.O.F.D.). d) Calculate the LSRL and the correlation. Report them. Be sure to define your variables. Add the line to your graph in part (b). e) Use your model to make a prediction that involves interpolation. f) Use your model to make a prediction that involves extrapolation. 5) Interpret the calculator output. The lifeguard at the Swimtastic Pool & Water-Slides decided to keep track of how many people came to the pool each day and compare this to the predicted high temperature for that day. The temperatures ranged from 82° to 96° during his data collection time period. He used the number of people as the response variable. Use this scatterplot and regression output from a TI-84 plus to answer the questions that follow. 279 www.ckl2.org D a ^^ ^^^ a a LinReg y=a+bx a= "2714. 353877 b=35. 07314871 r*=. 6120411792 r=. 7823306073 a) Write the regression equation. Define your variables. b) Identify and interpret the slope in the context of the problem. c) What are the two things that the correlation tells us in this situation? d) Based on this model, how many people would you predict on a 91° day? How about a 45° day? Are both of these predictions reasonable? Why or why not? e) What does r tell us in this situation? www.ckl2.org 280 6.5 Chapter 6 Review In this chapter, we have learned that when working with bivariate, numerical data it is important to first identify whether there is an explanatory and response relationship between the two variables. Often one of the variables, the explanatory (independent) variable, can be identified as having an impact on the value of the other variable, the response (dependent) variable. The explanatory variable should be placed on the horizontal axis, and the response variable should be placed on the vertical axis. Next we learned how to construct a visual representation, in the form of a scatterplot, so that we can see what the association looks like. A scatterplot helps us see what, if any, association there is between the two variables. If there is an association between the two variables, it can be identified as being strong if the points form a very distinct form or pattern, or weak if the points appear more randomly scattered. If the values of the response variable generally increase as the values of the explanatory variable also increase, the data has a positive association. If the response variable generally decreases as the explanatory variable increases, the data has a negative association. We also are able to see the form of the pattern, if any, in the graph. When the data looks reasonably linear, we learned how to use technology to calculate the least-squares regression line and the correlation coefficient. The least-squares regression line is often useful for making predictions for linear data. However, we now know to beware of extrapolating beyond the range of our actual data. Correlation is a measure of the linear relationship between two variables - it does not necessarily state that one variable is caused by another. For example, a third variable or a combination of other things may be causing the two correlated variables to relate as they do. We learned how to interpret the linear correlation coefficient and that it can be greatly affected by outliers and influential points. Also, just because two variables have a high correlation, does not mean that they have a cause-and-effect relationship. Correlation + Causation! Beyond constructing graphs and calculating statistics, we learned how to describe the relationship between the two variables in context. The acronym we learned to help us remember what to include in our descriptions is S.C.O.F.D. This tells us to describe the strength of the association, to be sure that our description is in context, to mention any outliers or influential points that we observe, and to describe the form and the direction of the relationship. We also learned how to interpret the slope and y-intercept of the least-squares regression line in context. Even though we are doing easy calculations, statistics is never about meaningless arithmetic and we should always be thinking about what a particular statistical measure means in the real context of the data. Chapter 6 Review Exercises Answer the following as TRUE or FALSE. 1) A negative relationship between two variables means that for the most part, as the x variable increases, the y variable increases. 2) A correlation of -1 implies a perfect linear relationship between the variables. 3) The equation of the regression line used in statistics is y = a + bx 4) When the correlation is high, one can assume that x causes y. Complete the following statements with the best answer. 5) The symbol for the Correlation coefficient is 6) A statistical graph of two variables is called a(n) . 7) The variable is plotted along the x-axis. 8) The range of r is from to . 281 www.ckl2.org 9) The sign of r and 10) LSRL stands for will always be the same. 11) If all the points fall on a straight line, the value of r will be or 12) If r = -0.86, then r 2 = . 13) If r 2 = 0.77, then r = or . 14) Using an LSRL to make predictions outside the range of our original data is called 15) Using an LSRL to make predictions within the range of our original data is called 16) When describing the relationship visible in a scatterplot, the acronym S.C.O.F.D. stands for 17) Suppose that a scatterplot shows a strong, linear, positive relationship, and the correlation coefficient is very high. However, both of the variables are actually increasing due to some outside lurking variable. This relationship suffers from . 18) Suggest possible lurking variables to explain the high correlations between the following variables. Consider whether common response, confounding, or coincidence may be involved. a) The number of cell phones being made has been increasing over the past 15 years. So has the number of starving children. Do cell phones cause starvation? b) The stress level of all of the employees at a certain company has been going up consistently over the past year. During this time, they have received three pay bumps. Does this mean that higher pay is causing the stress? c) Suppose that a study shows that the number of hours of sleep a person gets is negatively correlated the number of cigarettes a person smokes. Does this mean that not sleeping causes a person to smoke more cigarettes? 19) Some researchers wanted to determine how well the number of beers consumed can predict what a person's blood alcohol content will be after a given length of time. They set up an experiment in which several volunteers each drank a randomly selected number of beers during a given time period. The volunteers were between 21 and 25 years of age, but all ranged in gender and in weight. Exactly three hours after they began to drink the beers, their BAC level was measured three times. The three measurements were averaged and the results are given in the following table. (This is fictitious data, but is based on calculations from the BAC calculator at: http://www.dot.wisconsin.gov) Number of Beers Consumed (3 hours) 10 2 4 6 8 3 3 7 8 5 9 4 6 2 BAC Level 0.29 0.034 0.094 0.1 0.135 0.025 0.062 0.23 0.225 0.127 0.137 0.13 0.06 0.01 a) Identify the explanatory and response variables and construct a scatter-plot (be neat & label your axes). www.ckl2.org 282 b) Calculate the LSRL and correlation. Report the equation and add it to your scatter-plot? Identify your variables (report what x and y stand for). c) Identify and interpret the slope in context. d) Identify and interpret the y-intercept in context. e) If a person drinks 6 beers during this time period, on average what do you predict the person's BAC will be? f) If a person drinks 15 beers during this time period, on average what do you predict the person's BAC will be? g) Are you confident in both of the previous answers? Why or why not? 20) When investigating car crashes, it is often necessary to try to determine the speed at which a vehicle was traveling at the time of the accident. Investigators are able to do this by measuring the length of the skid mark left by the vehicle in question. The following table lists several speeds (mph) based on the skid length (feet), according to the Forensic Dynamics website: http://forensicdynamics.com. SPEED BASEDON SKID LENGTH Skid Length [feet) Estimated Speed (mph) 45 30.68 20 20.45 56 34.23 S 12.93 73 40.4 93 44.11 165 58.75 115 49.05 142 54.51 134 62.05 215 67.07 247 71.89 283 www.ckl2.org a) Identify the explanatory and response variables and construct a scatter-plot (be neat & label your axes). b) Calculate the LSRL and add it to your scatter-plot? Report the equation and identify your variables. c) Describe the relationship you see in the scatter-plot (S.C.O.F.D.). Be thorough & use complete sentences! Be sure that you explain the relationship in the context of the problem (overall trend between the two variables). d) What is the correlation coefficient? Based on your scatterplot and the value of r, how well do you feel that your model fits this data? Explain e) What is the predicted speed if the skid mark is 157 feet? If it were 36 feet? f) Would you expect predictions beyond 250 feet to generally over-estimate or under-estimate the actual speed of the vehicle? Why? www.ckl2.org 284 Chapter 7 The Normal Distribution 7.1 Introduction to the Normal Curve ,M0* Learning Objectives Understand how a density curve can be used to approximate the data in a histogram Understand how to visually identify the mean and standard deviation of a normal distribution Be able to tie the concepts of percentages in the 68-95-99.7 empirical rule to normal distributions In previous chapters we have seen how data can be represented by histograms. A density curve is a curve that gives an approximate description of a distribution. The curve is smooth, so any small irregularities in the data are ignored. A density curve for a particular histogram is shown below. Perhaps the most important thought to remember about a density curve is that it represents 100% of the data. In other words, the area under any density curve is equal to 1. This is important because it allows us to ask probability questions about a population. For example, we might ask how likely is it that a teenager has a shoe size of 8 or larger. 285 www.ckl2.org 6 10 In our chapter, we will focus on a special density curve called the normal curve. Have you ever wondered if you are 'normal'? You probably are normal in most ways, but there may be some things about you that might not be considered normal by the mathematical definition. If you are on the high school baseball team, do you throw the baseball at a 'normal' speed? Is your hair a 'normal' length? Do you drive at a 'normal' speed on the freeway? Our goal this chapter is to gain an understanding of what 'normal' really is and how to properly calculate within the Normal Distribution. We have seen skewed distributions before. The density curves in the following figure show one density curve that is skewed left and one that is skewed right. Skewed Left Skewed R^ght ***** m A normal curve is neither skewed left nor right and is often referred to as 'the bell curve' because of its shape. It is symmetrical. In addition, as you get closer and closer to the middle of the curve, there is a higher frequency of results. The mean (along with the median and mode) always lands at the center of a normal distribution. When dealing with the mean in previous chapters, we have used the symbol x because the data came from a sample. Normal distributions deal with an entire population instead of just a sample and we will use the symbol // (Greek letter mu) to mark the mean of a normal distribution for an entire population. The mean is one of two key values needed to make a proper sketch and analysis of a normal distribution. The curve shown below represents a normal distribution and is a good representation of what a normal curve looks like. www.ckl2.org 286 Mean Median Mode Normal Distribution Center Note that the amount of data to the right of the mean is the same as the amount of data to the left of the mean. Thinking about the definition of the median, this suggests that the mean and median are located at the same point. The other key component used to construct and analyze a normal distribution is the standard deviation. The standard deviation is a measure of spread and can be loosely thought of as a 'typical' distance from the mean. You may have calculated the standard deviation before for data sets either by hand or by using your calculator and looked for the S x in the statistical calculations summary screen. The symbol S x is used for the standard deviation whenever data is collected through the use of a sample from a population. When dealing with the normal distribution, we will use the symbol a (Greek letter sigma) to represent the standard deviation. The <r symbol indicates that the standard deviation of the entire population is known. Visually, the standard deviation can be seen as the distance from the mean to an inflection point. An inflection point is located on a curve at the point where the curve changes from concave up (bent up) to concave down (bent down) or vice versa. On the normal curve in Figure 7.1, the mean is 23 and the standard deviation is 3. Length of the Standard Deviation Inflection Point 23 26 Figure 7.1 287 www.ckl2.org The 68-95-99.7 Rule ***** It is now time to make use of the some of the special characteristics of the normal curve. As mentioned earlier, 100% of all results fall somewhere under the normal curve. It turns out that approximately 68% of all results are within one standard deviation of the mean, 95% of all results are within 2 standard deviations of the mean, and 99.7% of all results land within three standard deviations of the mean. These percentages are illustrated in the graphic below. ii-3.3" tt-2& tf-lcr tf+lff if 4 2cr .U4 3(J The numbers on the bottom represent the number of standard deviations from the mean. For example, the // — lcr marks the point one standard deviation below the mean. Some simple addition and subtraction allows us to be very specific in the percents of the data that land in the sections of the normal curve as shown below. MEAN 2.35% 34% 34% -3 -2 -1 Can you see the 68-95-99.7 rule here? 2.35% Example 1 Suppose the mathematics portion of the SAT exam is normally distributed with a mean of 500 and a standard deviation of 100. a) Sketch a normal curve for this situation marking the mean and the values 1,2, and 3 standard deviations above and below the mean. www.ckl2.org 288 b) Using the 68-95-99.7 rule, approximately what percent of students scored at least 600 on this test? c) Between approximately which two scores did the middle 95% of students score? d) Suppose that 4600 students take the exam this month. How many of those students should we expect to obtain a score of at least 700? Solution a) SAT Score Distribution _■ i i i i i i i i i i i i i i i i i i i_ 200 300 400 500 600 700 800 b) We know that 50% of all results are below the 500 marker and that 34% of all results land between 500 and 600. We have used up 50% + 34% = 84% of all results. This tells us that 100% - 84% = 16% of all students scored above 600 on the mathematics portion of the SAT. c) The middle 95% of all students scored within 2 standard deviations of the mean or between 300 and 700. d) A score of 700 marks the boundary two standard deviations above the mean such that only 2.5% of all test takers will score at least 700. 2.5% of 4600 is 115 students. Example 2 The normal curve below represents the number of races that a typical racehorse will run in one calendar year. 289 www.ckl2.org 11 13 a) Approximately what percent of racehorses will run between 5 and 11 races during a calendar year? b) What are the values of the mean and standard deviation for the distribution shown? Solution a) Add 13.5% + 34% + 34% to get 81.5% so 81.5% of racehorses run between 5 and 11 races per year. b) The mean racehorse will run 9 races per year with a standard deviation of 2 races. What is Normal? Let's now go back and try to think about our original question "What is normal?" In mathematics, the middle 95% is often (but not always) considered our 'normal' group. For example, suppose the ACT exam is normally distributed with a mean of 18 and a standard deviation of 6. Our 'normal' group would be comprised of those students who scored anywhere within two standard deviations of the mean or from 6 to 30 on the exam. A student who scored 31 or higher on the exam would have achieved an exceptional score. We might say that this student was not normal with regards to their ACT score. Normal distributions are not as common as you might think. What if we measured the lengths of shoes of teenagers? Many students think that this would be normal when in fact, there are a couple of contributing factors that might tip us off that the situation may not be normal. First of all, teenagers encompass a large population. Most of those who are in their upper teen years have finished growing into their adult shoe size length whereas many of the younger teens are still growing. This would tend to give us a slightly larger percentage of smaller shoe lengths than we might expect from a normal distribution. In addition, teenagers include males and females. This may lead to us seeing a situation which might be bi-modal. We might expect to see a peak at the most common male lengths and at the most common female lengths. Example 3 Which situation below is most likely to produce a normal distribution? www.ckl2.org 290 a) The heights of all adults. b) The wingspans of three year-old American eagles. c) The number of teeth that Americans adults have. Solution The correct answer is b). Three year-old American eagles have an average wingspan and we would expect that there are quite a few eagles at that wingspan or very close to it. As we move further and further up and down from that average, we would expect to see fewer and fewer eagles with those wingspans. Answer a) could be ruled out quickly in that the heights here do not specify a particular group. For example, this data would include males and females. Answer c) is out because the vast majority of American adults have 32 teeth. As we move away from 32, there are some people with fewer teeth due to a variety of reasons but there are virtually no people with more than 32 teeth. We should see symmetrical results if this was a normal distribution. http://www.anoka.kl2.mn. us/education/page/download. php?fileinfo=Ny0xX01udHJvX3RvX05vcmlhbF9DdXJ2ZV£ Problem Set 7.1 Exercises 1) Consider the histogram shown below. a) Make a sketch of the histogram and overlay a sketch of a density curve for the histogram. b) What is the area under your density curve? c) What is the shape of the density curve? 2) A roadside bait salesman digs up worms to sell to fishermen. It turns out that the worms have a mean length of // = 112 mm and a standard deviation of <x = 12 mm. a) Draw and label a normal curve for this distribution. Include lines for the mean and for 1, 2, and 3 standard deviations above and below the mean. b) What percentage of the worms will have lengths longer than 112 mm? 291 www.ckl2.org c) What percentage of the worms will have lengths between 100 and 124 mm long? d) What percentage of the worms will have lengths between 100 and 112 mm long? e) What percentage of the worms are longer than 124 mm? f) What percentage of the worms are shorter than 88 mm? 3) Sketch a normal curve which has a mean of 13 pounds and a standard deviation of 3 pounds. Include lines for the mean and for 1, 2, and 3 standard deviations above and below the mean. 4) Not all 12-ounce cans of soda are the same. It turns out that the average 12-ounce can of soda does contain twelve ounces of soda, but the amount of soda is normally distributed with a standard deviation of 0.15 ounces. Fill in the blanks for each statement below. a) The middle 68% of all 12-ounce soda cans contain between & ounces of soda. b) The middle 95% of all 12-ounce soda cans contain between k ounces of soda. c) The middle 99.7% of all 12-ounce soda cans contain between soda. k ounces of 5) Figure 7.2 shown below shows an approximate distribution of the number of fish caught by the com- petitors during a one hour pan-fishing contest. Give the approximate values of the mean and the standard deviation for the distribution. 6) Suppose the weights of adult males of a particular species of whale are distributed normally with a mean of 11,600 pounds and a standard deviation of 640 pounds. a) Draw a normal curve for this situation. Use vertical lines to mark and label the mean and 1,2, and 3 standard deviations above and below the mean. b) What percent of these whales weigh less than 10,320 pounds? c) Between what two weights do the middle 99.7% of these whales weigh? www.ckl2.org 292 Figure 7.2 d) What percent of these whales weigh between 10,320 pounds and 12,240 pounds? 7) Which situation is most likely to be normally distributed? Explain your reasoning. i) The hair lengths for all the Statistics and Probability students who have Mr. Johnson as a teacher. ii) The prices of all new Ipod Touches that are sold in Minnesota this week. iii) The average running times for all 4th grade boys at Andover Elementary in the 50 yard dash. 8) Suppose a standard incandescent light bulb will run an average of 400 hours before burning out. Of course, some bulbs burn out sooner and some last longer. Suppose that the average lives of these bulbs is normally distributed with a standard deviation of 35 hours. a) Sketch and label a normal curve to illustrate this situation. b) What percent of these bulbs will burn out in 400 hours or less? c) If you are lucky, your bulb will last longer than advertised. What percent of bulbs should last 435 hours or more? What percent of bulbs will last 470 hours or more? d) If you had 5000 bulbs that you needed for use in a large office building, how many would you expect to last at least 365 hours? 9) Suppose that the time that it takes for a popcorn kernel to pop produces a normal distribution with a mean of 145 seconds and a standard deviation of 13 seconds for a standard microwave oven. a) It is usually not a good idea to let the microwave oven run until all the kernels are popped because some of the popcorn will start to burn. Suppose the ideal time to shut off the microwave oven is after about 97.5% of the kernels have popped. When will 97.5% of the kernels be popped? b) Between what two times will we see the middle 68% of kernels popped? 293 www.ckl2.org 10) After a great deal of surveying, it is determined that the average wait times in the cafeteria line are normally distributed with a mean of 7 minutes and a standard deviation of 2 minutes. Suppose that 400 students are released to the cafeteria for 2nd lunch. a) Approximately how many students will have to wait more than 5 minutes for their food? b) Approximately how many students will have to wait more than 11 minutes for their food? 11) Sudoku is a popular logic game of number combinations. It originated in the late 1800s in the French press, Le Siecle. The mean time it takes the average 11th grader to complete the Sudoku puzzle below was found to be 19.2 minutes, with a standard deviation of 3.1 minutes. a) Draw a normal distribution curve to represent this data. b) Suppose Andover High School is going to put together a Sudoku team. The coach has decided that she will only consider players who score in the fastest 2.5% of the junior class as she puts together the team. How fast must a student solve a puzzle to be in the top 2.5% of puzzle solvers? c) If there are 400 kids in the Andover junior class, how many of them will be able to solve the Sudoku puzzle below in 16.1 minutes or less? 5 3 7 6 1 9 5 9 8 6 8 6 3 4 8 3 1 7 2 6 6 2 8 4 1 9 5 8 7 9 12) In order to qualify for undercover detective training, a police officer must take a stress tolerance test. Scores on this test are normally distributed with a mean of 60 and a standard deviation of 10. Only the top 16% of police officers score high enough on the test to qualify for the detective training. What is the cutoff score that marks the top 16% of all scores? www.ckl2.org 294 Review Exercises 13) Use your calculator to find the mean and standard deviation of the data set below. 3, 4, 4, 5, 5, 5, 6, 6, 7, 7, 8 14) A pet store must select 2 dogs and 2 cats for display in their front window. In how many ways can this be done if there are 16 dogs and 12 cats available to choose from? 15) By hand, give the five number summary for the data set below. 3, 5, 5, 6, 8, 9, 10, 10, 12, 13, 13, 13, 14, 15, 17, 19, 19, 20 16) A student conducts a survey in which 100 tenth-graders are asked "What is your favorite item on the lunch menu at school today?" The student decides to conduct this survey by handing each tenth- grader a survey sheet while they are eating and asking them to fill it out and turn it in to room P202 by the end of the day. Why will this survey method have a problem with bias? 7.2 Z-Scores, Percentiles, and Normal CDF ***** Learning Objectives Be able to calculate and understand z-scores Understand the concept of a percentile and be able to calculate it for a particular result Be able to calculate percentages of data above, below, or in between any specific values in a normal distribution 295 www.ckl2.org Be able to use z-scores to compare results for two different but related situations In section 7.1, we analyzed normal distributions and specific situations in which analysis was done for data which followed the 68-95-99.7 rule exactly. The truth of the matter is that most situations require us to answer questions that do not reference exact whole numbers of standard deviations above or below the mean. What if we asked a student what their actual score would be if they were in the top 10% of ACT test takers? We need a tool to help us deal with these types of situations. Our first tool will be the z-score formula. The z-score is a measure of how many standard deviations above or below the mean a particular value is. If a z-score is negative, the result is below the mean and if it is positive, the result is above the mean. For example, if the ACT mathematics exam scores are normally distributed with a mean of 18 and a standard deviation of 6, then an ACT score of 30 would be equivalent to a z-score of 2 because 30 would be 2 standard deviations above the mean. The formula below gives a quick way to calculate z-scores. In the formula, 'x' is the observation, n is the mean of the distribution, and <x is the standard deviation for the distribution. Example 1 Suppose the mean length of the hair of 10th grade girls is 10 inches with a standard deviation of 4 inches. What would be the z-score for hair length for a 10th grade girl whose hair is 16 inches long and what does it mean in terms of the normal curve? Solution It is often a good idea to draw a sketch for these sorts of situations so we can visualize what is happening. www.ckl2.org 296 14 16 18 Because 16 is located between 1 and 2 standard deviations above the mean, we expect a z-score between 1 and 2. Use the formula z = -£- to calculate the z-score. Our observation, x, is 16 inches while the mean is fj. = 10 inches and the standard deviation is o~ = 4 inches, z = T or z = 1.5. This tells us that a hair length of 16 inches will be 1.5 standard deviations above the mean. Example 2 Suppose that the z-score for a particular 10th grade girl's hair length is z the girl's hair? -1.25. What is the length of Solution We will use the z-score formula to find our answer. -1.25 =*=P -5 = jc-10 5 = x The length of the hair for this girl would be 5 inches. Example 3 Suppose a student can either submit only their SAT score or their ACT score to a particular college. Suppose their SAT score was 620 and that the SAT has a mean of 500 and a standard deviation of 100. Suppose also that the same student scored a 25 of their ACT exam and that the ACT exam has a mean of 18 and a standard deviation of 6. Which score should the student submit? Solution Looking at the diagram below, it is not exactly clear which score is better. They appear to be quite similar and we will need to do some calculations to make a distinction. 297 www.ckl2.org SAT ACT 500 600 700 Calculate the z-score for each exam. For the SAT, z 620-500 1.2. For the ACT, z 25-18 100 " U^-ilV,^, «. g 1.17. Since the z-score is higher on the SAT, the student should submit the SAT exam score. Percentiles In order to understand how to apply z-scores beyond what we have already done, we must first understand percentiles. A percentile is a marker on a normal curve such that the marker is greater than or equal to that percentage of results. For example, suppose you are at the 30th percentile for how fast you type. This means that you can type faster than 30% of all people. The percentile can also be thought of as the percent of area to the left of its marker. The graphic below shows where the 30th percentile is located. The shaded area to the left of the marker represents 30% of the normal curve. Markerforthe 30 th Percentile It is very common for colleges and universities to use percentiles for entrance criteria. For example, a rather elite university might require that you score at the 90th percentile or higher on your ACT exam to be considered for admissions. Doctors often use percentiles to track the growth of babies. For example, can you picture what a baby would look like that is at the 70th percentile for weight and the 25th percentile for length? Now we must ask what percentiles have to do with z-scores. Find the Normal Distribution Table in Appendix A, Part 2 of your book. Let's examine the z-score of -1.25 from Example 2. Find the z-value of -1.2 and then go over until you are under the 0.05 column. A partial table is given in Figure 7.3 below and the value in the cell we are looking for is bold and underlined. The value of 0.1056 can be interpreted as a percentile. This means that the girl in Example 2 has hair that is longer than 10.56% of all girls. In other words, she is at about the 10th or 11th percentile for hair length for 10th grade girls. www.ckl2.org 298 Negat ivez-scores: Z 0.09 0.03 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.0 -1.3 0.0823 0.0838 0.0853 0.0869 0.OSS5 0.0901 0918 0.0934 0.0951 0.0968 -1.2 0.0985 0.1003 0.1020 0.1033 0.1056 0.1075 0.1093 0.1112 0.1131 0.1151 -1.1 0.1170 0.1190 0.1210 0.1230 0.1251 0.1271 0.1292 0.1314 0.1335 0.1357 Figure 7.3 Example 4 At what percentile for hair length is a 10th grade girl if her hair is 17 inches long? Solution Start by determining her z-score which would be z 17-10 1.75. We now go to the Normal Distribution Table in Appendix A, Part 2 of the book. We go across the row with z=1.7 until we are under 0.05. This gives a value of 0.9599. This tells us the girl is at about the 96th percentile for hair length. In other words, this girl's hair is longer than 96% of all 10th grade girls. 'Between' and 'Above' Problems While it is nice to find percentiles for certain situations, we are often asked for the percentage of results that are between two given parameters or above a given parameter. For example, we might be asked to find the percentage of all 10th grade girls that have hair lengths between 8 inches and 15 inches long. To find these types of results, we often must do multiple z-score calculations and some addition or subtraction. Example 5 Suppose the weights of adult males of a particular species of whale are distributed normally with a mean of 11,600 pounds and a standard deviation of 640 pounds. a) What percent of these adult male whales will weigh between 11,000 and 12,000 pounds? b) What percent of these adult male whales will weigh more than 12,000 pounds? Solution a) Begin by finding the z-scores for both of the weights given and get z 11,000-11,600 -0.9375 and z 12,000-11,600 640 fjg = 0.625. For z= -600 640 -0.9375, our Normal Distribution Table in Appendix A, Part 2 gives us a value between 0.1736 and 0.1762. Since -0.9375 is closer to -0.94 than -0.93, we will use a value of 0.174. We get a value between 0.7324 and 0.7357 for z=0.625. We will split the difference on this and use 0.734. All that is left to do now is subtract 0.734 and 0.174 to get 0.56 or about 56% of all adult male whales of this species are between 11,000 299 www.ckl2.org and 12,000 pounds. The shaded region in the Figure 7.4 below represents about 56% of the normal curve. 11,000 12,000 Figure 7.4 b) Use z=0.625 from part a) to get a value from the table of 0.734. This means that 73.4% of all whales weigh 12,000 pounds or less. Therefore, 100%-73.4%=26.6% of all whales weigh more than 12,000 pounds. Technology It is also important to note that graphing calculators can be used to quickly solve the types of problems dis- cussed in this section by using the NormalCdf command. Typically, this command requires that four values be entered, the lower bound, the upper bound, the mean, and the standard deviation. In Example 5, we can solve the problem in part a) simply by typing in the command string NormalCdf(11000, 12000, 11600, 640) and obtain the immediate result of 0.5598 or 56%. Be sure you know how to access this command if you have a graphing calculator. Appendix C has some notes for users of graphing calculators. An online calculator that is very similar to a graphing calculator and gives us the same information can be found at http://wolframalpha.com . You might also be wondering how to solve a problem using the NormalCdf command if only one parameter is given. Let's revisit Example 4 to see how this works. Example 6 At what percentile for hair length is a 10th grade girl if her hair is 17 inches long? Solution There is only one boundary given in this problem. It is your job to come up with a second boundary. In this case, the percentile we want to calculate is found by finding the percentage of all girls whose hair is 17 inches or less. We will use a lower bound of -100 and an upper bound of 17. We use -100 simply because we are confident that we will not find any results any further left than this. Typically, choose your missing parameter as being so extreme that it will not be even in the realm of possible results. NormalCdf(-100,17,10,4)=0.9599 so the length of the girl's hair is at about the 96th percentile. www.ckl2.org 300 Problem Set 7.2 Exercises For problems 1) through 14) use the following information: On a particular stretch of road, the number of cars per hour produces a normal distribution with a mean of 125 cars per hour and a standard deviation of 40 cars per hour. 1) Sketch a normal curve for this situation. Be sure to label and mark the mean and 1 and 2 standard deviations above and below the mean. 2) What is the z-score for an observation of 165 cars in one hour? 3) What is the z-score for an observation of 85 cars in one hour? 4) Calculate the z-score associated with an observation of 171 cars in one hour. 5) Suppose 135 cars are observed in one hour. At what percentile would this observation occur? 6) Suppose 70 cars are observed in one hour. At what percentile would this observation occur? 7) At what percentile would an observation of 125 cars occur? 8) What is the probability of observing at least 145 cars on the road in a an hour? 9) What is the probability of observing between 100 and 150 cars on the road in an hour? 10) Determine the percentile for an observation of 140 cars on the road in one hour. 11) Determine the percentile for an observation of 65 cars on the road in one hour. 12) Determine the probability of observing between 90 and 130 cars on the road in one hour. 13) Determine the probability of observing at least 160 cars on the road in one hour. 14) Determine the probability of obsering no more than 110 cars on the road in one hour. For problems 15) through 20) use the following information: The number of ants found in one mature colony of leafcutter ants is normally distributed with a mean of 136 ants and a standard deviation of 14 ants. 301 www.ckl2.org 15) One ant colony has 165 ants. At what percentile for size is this ant colony? 16) An ant colony has a z-score of -1.35 for size. How many ants would we expect to find in this colony? 17) Another ant colony has 131 ants. What is the z-score for this ant colony? 18) What is the probability of finding an ant colony with 160 ants or less? 19) What is the probability of finding an ant colony with 150 ants or more? 20) What is the probability of finding an ant colony that has between 120 and 155 ants in it? 21) Twin brothers Ricky and Robbie each took a college entrance exam. Ricky took the SAT which had a mean of 1000 with a Standard Deviation of 200 while Robbie took the ACT which had a mean of 18 with a standard deviation of 6. Which brother did better if Ricky scored a 1140 and Robbie scored a 22? 22) Suppose the average height of an adult American male is 69.5 inches with a standard deviation of 2.5 inches and the average height of an adult American female is 64.5 inches with a standard deviation of 2.3 inches. Who would be considered taller when compared to their gender, an adult American male who is 74 inches tall or an adult American female who is 68.5 inches tall? Explain your answer. 23) Professional golfer John Daly is one of the longest hitting golfers in history. Suppose his drives average 315 yards with a standard deviation of 12 yards. Will a drive of 345 yards be in his top 1% of his longest drives? Explain your answer. Review Exercises 24) What is the area under any density curve equal to? 25) In a standard deck of 52 cards, what is the probability of being dealt two queens if you are dealt two cards from the deck without replacement? www.ckl2.org 302 26) In a class competition, each grade (9-12) enters 10 students to run in a 500 meter race. Boys times for 9th graders and 12th graders are given below in seconds. Build a back-to-back stem plot to compare data for the two groups of students. 9th Grade Times = 115, 118, 118, 121, 126, 127, 131, 134, 140 12th Grade Times = 106, 106, 109, 112, 114, 116, 116, 121, 122, 133 27) It turns out that countries that have higher percentages of people with computers also tend to have people who live longer. Is it logical to assume that shipping many computers to countries whose people have lower life-expectancies will help the people in those countries live longer? Answer the question including justification that references either Cause and Effect, Common Response, Confounding, or Coincidence. 28) A sample survey at a local college campus asked 250 students how many textbooks they were currently carrying. The table below shows a summary of the findings. Use the table to determine the expected number of textbooks that an average college student at this campus would be carrying. Table 7.1: Textbooks Carried by Students # of Books 12 3 Probability 0.21 0.37 0.32 0.1 7.3 Inverse Normal Calculations a** Learning Objectives • Understand how to use the Normal Distribution Table and the z-score formula to find values for a particular normal distribution given a percentile 303 www.ckl2.org • Be able to use the Inverse Normal command on a graphing calculator to find values for a particular normal distribution given a percentile • Be able to find values for a particular normal distribution given a 'middle' percentage range We can now comfortably calculate percentages, percentiles, and probabilities given key information about a normal distribution. It is possible to go the other direction. In other words, if you are told a certain result is at a specific percentile, you can figure out what the actual value is equal to that is at that percentile. The process can be done using the Normal Distribution Table in Appendix A, Part 2. Begin by identifying the percentile you are interested in and finding it in the table. From there, put the value from the table into the z-score formula and solve it for the observation in question. Example 1 Suppose that 10th grade girls have hair lengths that are normally distributed with a mean of 10 inches and a standard deviation of 4 inches. How long would a 10th grade girl's hair have to be in order to be at the 80th percentile for length? Solution The figure below shows the distribution of hair lengths and also marks where the 80th percentile is located. 80 th Percentile Marker 10 14 18 Begin by finding the value closest to .8000 in the Normal Distribution Table. We find our closest value to be .7995 which corresponds to a z-score of 0.84. Put this value into the z-score formula to get 0.84 : -10 jr-10 4 0.84 3.36 = x- 10 13.36 = x A 10th grade girl would have to have a hair length of about 13.4 inches to be at the 80th percentile. This looks to be right based upon comparison to the figure above. www.ckl2.org 304 Technology Once again, it is important to note that technology can be used to solve these types of problems without having to reference the Normal Distribution Table. The command that is commonly used for these types of problems is the Inverse Normal command or InvNorm. The Inverse Normal command requires users to enter the percentile in question, the mean, and the standard deviation. To solve the problem in Example 1, we could have typed in InvNorm(0.80,10,4) and we would have immediately had an answer of 13.366 or about 13.4 inches of hair. Be sure you know how to access this command if you have a graphing calculator. Appendix C has some notes for users of graphing calculators. An online calculator that can produce the same information can be found at http://wolframalpha.com . 'Middle' and 'Top' Problems Finally, we are sometimes in situations where we want to know what range of results are found in a middle percentage interval or what value one would have to be at in order to be in a specific top percentage. For example, a car salesman might wish to know what sales prices comprise the middle 50% of his sales to help him learn more about who his customers tend to be or a student might wish to know what they need to score on a test in order to be in the top 10%. Once again, this process can be done with either the Normal Distributions Table or by using technology. Example 2 Professional golfer John Daly is known for his long drives off the tee. Suppose his drives have a mean distance of 315 yards with a standard deviation of 12 yards. What lengths of drives will constitute the middle 60% of all of his drives? Solution The sketch below is helpful in understanding what is happening here. 315 305 www.ckl2.org It is easy to calculate that marker line 'a' is at the 20th percentile and marker line 'b' is at the 80th percentile simply by noting their relationship to the 50th percentile marker. In addition, note that 'a' and 'b' clearly enclose the middle 60 percent of all data. From the Normal Distributions Table, we can see that the z-score associated with the 20th percentile is -0.84 and the z-score associated with the 80th percentile is 0.84. We now calculate -0.84 = x ~^ 5 or x = 303.2 yards. A similar calculation at the 80th percentile gives us x = 326.8 yards. In other words, the middle 60% of John Daly's drives will travel between 303.2 yards and 326.8 yards. We also could have used the Inverse Normal command once we knew the percentiles. In- vNorm(. 20,335, 14) = 323.2 yards and InvNorm(. 80,335, 14) = 346.8 yards. Example 3 In a weightlifting competition, the amount that the competitors can lift is normally distributed with a mean of 196 kg and a standard deviation of 11 kg. Only the top 20% of all competitors will be able to advance to the next phase of the competition. What amount must a competitor lift in order to move into the next phase of the competition? Solution The key to this problem is noticing that to be in the top 20%, a competitor would actually have to be at the 80th percentile. The z-score at the 80th percentile is z=0.84. 0.84 = ^6 9.24 = x- 196 205.24 = x The competitor would have to lift about 205 or 206 kg. Using a calculator, we get In- vNorm(.8,196,ll) = 205.26 kg. Problem Set 7.3 Exercises 1) The Standard Normal Curve is defined as having a mean of and a standard deviation of 1. a) What is the z-score associated with a result at the 84th percentile? b) What is the z-score associated with a result at the 16th percentile? c) Find a z-score such that only 5% of the Standard Normal Curve is to the right of that z-score. d) Find a z-score such that only 35% of the Standard Normal Curve is the left of that z-score. e) Find the two z-scores such that the middle 50% of the Standard Normal Curve is between the two z-scores. www.ckl2.org 306 2) Doctors often monitor their patients blood-glucose levels. It is known that for blood-glucose levels, yu 85 and a = 25. American Diabetes Association ® a) Draw and label sketch of the normal distribution for this situation marking the mean and 1,2, and 3 standard deviations above and below the mean. b) It turns out that doctors consider the blood-glucose level of a patient to be normal if the level is in the middle 94% of all results. What range of blood-glucose levels constitute the middle 94% of all results? c) Patients are considered to be high risk for diabetes if their blood-glucose test comes back in the top 1% of all results. What blood-glucose level marks the start of the top 1% of blood- glucose levels? d) Doctors also show concern if there is too little blood- glucose in a patient's system. They will prescribe treatments to patients if their blood-glucose is in the lowest 2% of all patients. What is the blood-glucose level that marks this boundary? 3) For a given population of high school juniors and seniors, the SAT math scores are normally distributed with a mean of 500 and a standard deviation of 100. For that same population, the ACT math exam has a mean of 18 with a standard deviation of 6. a) One school requires that students score in the top 10% on their SAT math exam for admission. What is the minimum score that a student must achieve to be considered for this school? b) Another school requires that students score in the top 40% on their ACT math exam for admission. What is the minimum score that a student must achieve to be considered for this school? c) One particular school likes to focus on mid-level students and so they only accept students who are in the middle 50% of all ACT math test takers. Between what two scores must a student achieve in order to be considered for acceptance into this school? d) One student boasts that they scored at the 85th percentile on their ACT math exam. Another student brags that they scored a 620 on the SAT math exam. Who did better? 4) Many athletes train to try to be selected for the US Olympic team. Suppose for the men's 100 meters, the athletes being considered for the team have a mean time of 10.06 seconds with a standard deviation of 0.07 seconds. In the final qualifying event for the team, only the top 20% of runners will be selected. What time must a runner get to be in the top 20%? 307 www.ckl2.org 5) A high school basketball coach notices that taller players tend to have more success on his team. As a result, the coach decides that only the tallest 25% of the boys in the 11th and 12th grades will be allowed to try out for the team this year. Suppose that the mean height of 11th and 12th grade boys is 5 feet 9 inches with a standard deviation of 2.5 inches. How tall must a player be in order to be able to try out for the team? 6) A student comes home to his parents and excitedly claims that he is in the top 90% of his class. Explain why this might not be worth getting excited about. 7) At a certain fast-food restaurant, automatic soft drink filling machines have been installed. For 20- ounce cups, the machine is set to fill up the cups with 19 ounces of soda. Unfortunately, the machine is not perfectly consistent and does not always dispense 19 ounces of soda. Suppose the amount it dispenses produces a normal distribution with a mean of 19 ounces and a standard deviation of 0.6 ounces. It turns out that the 20 ounce cup will actually hold a bit more than 20 ounces. A mathematically inclined worker notices this and starts to record what happens when the machine fills the cups. It turns out that the cups overfill 2% of the time. How much soda will the 20-ounce cup actually hold? Review Exercises 8) Adult male American bald eagles have a mean wingspan of 79 inches with a standard deviation of 3.5 inches. What percent of these eagles have wingspans longer than 7 feet? 9) Consider the data in the table below where 'x' represents the number of alcoholic drinks consumed and www.ckl2.org 308 'y' represents the blood-alcohol concentration. X 3 1 4 5 3 4 V 0.08 0.02 0.07 0.11 0.05 0.09 X 2 7 8 5 4 8 V 0.06 0.13 0.15 0.07 0.10 0.14 a) Make a labeled and scaled scatterplot for the data set. b) Determine the correlation coefficient, r, for the scatterplot. c) Give the least-squares linear regression equation for the scatterplot. d) Using your answer from part c), predict the blood-alcohol concentration for a person who has had 6 drinks. e) Using your answer from part c), predict how many drinks someone has had if their blood- alcohol concentration is 0.12. 10) Consider a standard set of 15 pool balls. Pool balls #l-#8 are solid and pool balls #9-#15 are striped. a) If you randomly select one pool ball, what is the probability that it is both solid and odd? b) If you randomly select one pool ball, what is the probability that it is either solid or odd? c) If you randomly select two pool balls without replacement, what is the probability that they are either both solid or both striped? 7.4 Chapter 7 Review In this chapter we have discussed what a density curve is and specifically focused on a special density curve called the normal distribution. The two critical elements that are necessary for analysis of a density curve are the mean and standard deviation. The mean is the center of the distribution while standard deviation is a measure of spread. We have focused on several key concepts including the 68-95-99.7 rule and z-scores. We then introduced the Normal Distribution Table and the NormalCdf and InvNorm commands to help us be able to move back and forth between probabilities and percentiles and specific values in our distributions. Chapter 7 Review Exercises 1) Suppose a teacher gives a test in which the scores on the test are normally distributed with a mean of 10 points and a standard deviation of 2 points. 309 www.ckl2.org a) Draw a normal curve to represent this situation. Clearly mark the mean and 1, 2, and 3 standard deviations above and below the mean. b) Using the 68-95-99.7 rule, approximately what percent of students will get a score between 6 and 14? c) Using the 68-95-99.7 rule, approximately what percent of students will get a score between 8 and 16? d) Find the percent of students that will get a score between 8 points and 13 points on this test. e) What percent of students will score at least an 11 points on this test? f) What percent of students will score between 5 points and 12 points on this test? g) How many points would a student have to score in order to be at the 90th percentile on this test? h) What is the z-score associated with a test score of 13 points? i) How many points did a student score if their z-score was -1.5? 2) Which situation below is most likely to be normally distributed? i) The heights of all the trees in a forest. ii) The distances that all the kids at Blaine High School can hit a golf ball. iii) The number of siblings that each student at Anoka High School has. iv) The length of time that 6th grade boys at Roosevelt Middle School can hold their breath. 3) The weights of adult male African elephants are normally distributed with a mean weight of 11,000 pounds and standard deviation of 900 pounds. www.ckl2.org 310 a) Between what two weights do the middle 50% of all adult male African elephants weigh? b) Suppose one of these elephants weighs 13,400 pounds. At what percentile is this weight? c) At what weight would we find the 70th percentile of weights for these elephants? 4) Suppose that IQ test scores are normally distributed with a mean of 100 and a standard deviation of 15. a) What z-score is associated with and IQ score of 125? b) The intelligence organization MENSA requires that members score in the top 2.5% of all IQ test takers to gain membership in the organization. What IQ score must a person score to qualify for MENSA? c) What percentage of IQ scores are greater than 125? d) What percentage of IQ scores are less than 70? Use the 68-95-99.7 rule to approximate your answer. e) Who did better, a person with an IQ score of 143 or someone who was at the 99th percentile on the IQ test? Justify your answer. 5) In a certain city, the number of pounds of newspaper recycled each month by a household produces a normal distribution with a mean of 8.5 pounds and a standard deviation of 2.7 pounds. a) Draw a sketch for this normal distribution and shade in the region that represents the households that recycle between 6 and 12 pounds of newspaper each month. b) What percent of households recycle between 6 and 12 pounds of newspaper each month? c) A local newspaper wants to do a story on newspaper recycling in the city. They decide that they would like to base their story on a typical household. After some thought, they decide that 'typical' means that they are in the middle 60% of all households in terms of newspaper recycling. Between what two weights are the 'typical' households? 6) Snowfall each winter in the Twin Cities is normally distributed with a mean of 56 inches and a standard deviation of 11 inches. a) In what percentage of years does the Twin Cities get less than 3 feet of snow? 311 www.ckl2.org b) In what percentage of years does the Twin Cities get more than 6 feet of snow? c) The winter of 2010-2011 was the fifth snowiest on record for the Twin Cities with a total snowfall of 85 inches. What percentage of years will have snowfalls of more than 85 inches? d) A winter is considered to be dry if it in the lowest 10% of snowfall totals. What is the maximum amount of snow the Twin Cities could receive to still be called a dry winter? 7) You just got your history test back and found out you scored 37 points. The scores were normally distributed with a mean of 31 points and a standard deviation of 4 points. When you tell your parents how you did, your little brother pipes in that he got a 56 on his math test which was normally distributed with a mean of 40 points and a standard deviation of 11 points. How could you use z-scores to explain to your parents that your score was more impressive than your little brother's score? 8) In 1941, Ted Williams batted 0.406 for the baseball season. He is the last player to hit over 0.400 for an entire major league baseball season. In 2009, Joe Mauer hit 0.365 for the baseball season. In 1941, the batting averages were normally distributed with a mean of 0.260 and a standard deviation of 0.041. In 2009, the batting averages were normally distributed with a mean of 0.262 and a standard deviation of 0.035. Decide which player had a better season compared to the rest of the league during their respective year by comparing z-scores. 9) Suppose that medals will be given out to any student at Andover High School that scores at least 200 points on an aptitude test. The mean score on the aptitude test is 150 points with a standard deviation of 22 points. How many medals should be ordered if there are 456 students who sign up for the test? Image References Density Curve www.madscientist.blogspot.com Skewed Distributions http://en.wikipedia.org/wiki/Skewness 68-95-99.7 Normal Curve www.rahulgladwin.com Earthworms http://www.flowers.vg Pet Store Window www.teddyhilton.com Traffic Jam www.rnw.nl www.ckl2.org 312 Leafcutter Ant www.orkin.com/ants Pair of Queens http://www.123rf.com American Diabetes Association http://americandiabetesassn.wordpress.com Track Race http://www.tierraunica.com Eagle http://www.esa.org Elephant http : / /animals .nationalgeograp hie .com Blizzard http://www.csc.cs.colorado.edu Joe Mauer http://www.mauersquickswing.com http://facstaff.unca.edu/dohse/Stat225/Images/Table-z.JPG 313 www.ckl2.org Chapter 8 Appendices 8.1 Appendix A - Tables Appendix A, Part 1 - Random Digit Table Line 101 19223 95034 05756 28713 96409 12531 42544 82853 Line 102 73676 47150 99400 01927 27754 42648 82425 36290 Line 103 45467 71709 77558 00095 32863 29485 82226 90056 Line 104 52711 38889 93074 60227 40011 85848 48767 52573 Line 105 95592 94007 69971 91481 60779 53791 17297 59335 Line 106 68417 35013 15529 72765 85089 57067 50211 47487 Line 107 82739 57890 20807 47511 81676 55300 94383 14893 Line 108 60940 72024 17868 24943 61790 90656 87964 18883 Line 109 36009 19365 15412 39638 85453 46816 83485 41979 Line 110 38448 48789 18338 24697 39364 42006 76688 08708 Line 111 81486 69487 60513 09297 00412 71238 27649 39950 Line 112 59636 88804 04634 71197 19352 73089 84898 45785 Line 113 62568 70206 40325 03699 71080 22553 11486 11776 Line 114 45149 32992 75730 66280 03819 56202 02938 70915 Line 115 61041 77684 94322 24709 73698 14526 31893 32592 Line 116 14459 26056 31424 80371 65103 62253 50490 61181 Line 117 38167 98532 62183 70632 23417 26185 41448 75532 Line 118 73190 32533 04470 29669 84407 90785 65956 86382 Line 119 95857 07118 87664 92099 58806 66979 98624 84826 Line 120 35476 55972 39421 65850 04266 35435 43742 11937 Line 121 71487 09984 29077 14863 61683 47052 62224 51025 Line 122 13873 81598 95052 90908 73592 75186 87136 95761 www.ckl2.org 314 Line 123 54580 81507 27102 56027 55892 33063 41842 81868 Line 124 71035 09001 43367 49497 72719 96758 27611 91596 Line 125 96746 12149 37823 71868 18442 35119 62103 39244 Line 126 96927 19931 36089 74192 77567 88741 48409 41903 Line 127 43909 99477 25330 64359 40085 16925 85117 36071 Line 128 15689 14227 06565 14374 13352 49367 81982 87209 Line 129 36759 58984 68288 22913 18638 54303 00795 08727 Line 130 69051 64817 87174 09517 84534 06489 87201 97245 Line 131 05007 16632 81194 14873 04197 85576 45195 96565 Line 132 68732 55259 84292 08796 43165 93739 31685 97150 Line 133 45740 41807 65561 33302 07051 93623 18132 09547 Line 134 27816 78416 18329 21337 35213 37741 04312 68508 Line 135 66925 55658 39100 78458 11206 19876 87151 31260 Line 136 08421 44753 77377 28744 75592 08563 79140 92454 Line 137 53645 66812 61421 47836 12609 15373 98481 14592 Line 138 66831 68908 40772 21558 47781 33586 79177 06928 Line 139 55588 99404 70708 41098 43563 56934 48394 51719 Line 140 12975 13258 13048 45144 72321 81940 00360 02428 Line 141 96767 35964 23822 96012 94591 65194 50842 53372 Line 142 72829 50232 97892 63408 77919 44575 24870 04178 Line 143 88565 42628 17797 49376 61762 16953 88604 12724 Line 144 62964 88145 83083 69453 46109 59505 69680 00900 Line 145 19687 12633 57857 95806 09931 02150 43163 58636 Line 146 37609 59057 66967 83401 60705 02384 90597 93600 Line 147 54973 86278 88737 74351 47500 84552 19909 67181 Line 148 00694 05977 19664 65441 20903 62371 22725 53340 Line 149 71546 05233 53946 68743 72460 27601 45403 88692 Line 150 07511 88915 41267 16853 84569 79367 32337 03316 315 www.ckl2.org Appendix A, Part 2 - The Normal Distribution Table For z-scores with z less than or equal to zero Table 8.1: z 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.0 -3.0 0.0010 0.0010 0.0011 0.0011 0.0011 0.0012 0.0012 0.0013 0.0013 0.0013 -2.9 0.0014 0.0014 0.0015 0.0015 0.0016 0.0016 0.0017 0.0018 0.0018 0.0019 -2.8 0.0019 0.0020 0.0021 0.0021 0.0022 0.0023 0.0023 0.0024 0.0025 0.0026 -2.7 0.0026 0.0027 0.0028 0.0029 0.0030 0.0031 0.0032 0.0033 0.0034 0.0035 -2.6 0.0036 0.0037 0.0038 0.0039 0.0040 0.0041 0.0043 0.0044 0.0045 0.0047 -2.5 0.0048 0.0049 0.0051 0.0052 0.0054 0.0055 0.0057 0.0059 0.0060 0.0062 -2.4 0.0064 0.0066 0.0068 0.0069 0.0071 0.0073 0.0075 0.0078 0.0080 0.0082 -2.3 0.0084 0.0087 0.0089 0.0091 0.0094 0.0096 0.0099 0.0102 0.0104 0.0107 -2.2 0.0110 0.0113 0.0116 0.0119 0.0122 0.0125 0.0129 0.0132 0.0136 0.0139 -2.1 0.0143 0.0146 0.0150 0.0154 0.0158 0.0162 0.0166 0.0170 0.0174 0.0179 -2.0 0.0183 0.0188 0.0192 0.0197 0.0202 0.0207 0.0212 0.0217 0.0222 0.0228 -1.9 0.0233 0.0239 0.0244 0.0250 0.0256 0.0262 0.0268 0.0274 0.0281 0.0287 -1.8 0.0294 0.0301 0.0307 0.0314 0.0322 0.0329 0.0336 0.0344 0.0351 0.0359 -1.7 0.0367 0.0375 0.0384 0.0392 0.0401 0.0409 0.0418 0.0427 0.0436 0.0446 -1.6 0.0455 0.0465 0.0475 0.0485 0.0495 0.0505 0.0516 0.0526 0.0537 0.0548 -1.5 0.0559 0.0571 0.0582 0.0594 0.0606 0.0618 0.0630 0.0643 0.0655 0.0668 -1.4 0.0681 0.0694 0.0708 0.0721 0.0735 0.0749 0.0764 0.0778 0.0793 0.0808 -1.3 0.0823 0.0838 0.0853 0.0869 0.0885 0.0901 0.0918 0.0934 0.0951 0.0968 -1.2 0.0985 0.1003 0.1020 0.1038 0.1056 0.1075 0.1093 0.1112 0.1131 0.1151 -1.1 0.1170 0.1190 0.1210 0.1230 0.1251 0.1271 0.1292 0.1314 0.1335 0.1357 www.ckl2.org 316 Table 8.1: (continued) -1.0 0.1379 0.1401 0.1423 0.1446 0.1469 0.1492 0.1515 0.1539 0.1562 0.1587 -0.9 0.1611 0.1635 0.1660 0.1685 0.1711 0.1736 0.1762 0.1788 0.1814 0.1841 -0.8 0.1867 0.1894 0.1922 0.1949 0.1977 0.2005 0.2033 0.2061 0.2090 0.2119 -0.7 0.2148 0.2177 0.2206 0.2236 0.2266 0.2296 0.2327 0.2358 0.2389 0.2420 -0.6 0.2451 0.2483 0.2514 0.2546 0.2578 0.2611 0.2643 0.2676 0.2709 0.2743 -0.5 0.2776 0.2810 0.2843 0.2877 0.2912 0.2946 0.2981 0.3015 0.3050 0.3085 -0.4 0.3121 0.3156 0.3192 0.3228 0.3264 0.3300 0.3336 0.3372 0.3409 0.3446 -0.3 0.3483 0.3520 0.3557 0.3594 0.3632 0.3669 0.3707 0.3745 0.3783 0.3821 -0.2 0.3829 0.3897 0.3936 0.3974 0.4013 0.4052 0.4090 0.4129 0.4168 0.4207 -0.1 0.4247 0.4286 0.4325 0.4364 0.4404 0.4443 0.4483 0.4522 0.4562 0.4602 -0.0 0.4641 0.4681 0.4721 0.4761 0.4801 0.4840 0.4880 0.4920 0.4960 0.5000 For z-scores with z greater than or equal to Table 8.2: z 0.0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359 0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753 0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141 0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0..6443 0.6480 0.6517 0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879 0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224 0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549 0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852 0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133 317 www.ckl2.org Table 8.2: (continued) 0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389 1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621 1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830 1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015 1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177 1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319 1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441 1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545 1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633 1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706 1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767 2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817 2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857 2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890 2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916 2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936 2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952 2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964 2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974 2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981 2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986 3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990 www.ckl2.org 318 Appendix A, Part 3 - A standard deck of 52 cards Clubs Spades Hearts Diamonds A* A+ A» A+ 2* 2+ 2» 2+ 3+ 3+ 3* 3+ 4+ 4+ 4* *♦ 5+ 5+ 5» 5+ 6+ 6+ 6* 6+ 7+ 7+ 7* 7+ 8+ 8+ SV B+ 9+ 9+ 9¥ 9+ 10* 10+ 10» Jack* Jack+ Jack+ Jack* Queen+ Queen + Q.ueen+ QueenV King+ King+ King+ King* King-*- Figure 8.1 Appendix A, Part 4 - Results for the total of two 6-sided dice + 1 2 3 4 5 6 1 2 3 4 5 6 7 2 3 4 5 6 7 8 3 4 5 6 7 8 9 4 5 6 7 8 9 10 5 6 7 8 9 10 11 6 7 8 9 10 11 12 Figure 8.2 319 www.ckl2.org 8.2 Appendix B - Glossary and Index 95% confidence statement - Page , Section "We are 95% confident that the true proportion of (parameter of interest) will be between (low value of conf. int.) and (high value of conf. int.) ." Back to Back Stem Plots - Page , Section A stem plot in which two sets of numerical data share the stems in the middle, with one set has its leaves going to the right and the other set has its leaves going to the left. Bar Graph - Page , Section A graph in which each bar shows how frequently a given category occurs. The bars can go either horizontally or vertically. Bars should be of consistent width and need to be equally spaced apart. The categories may be placed in any order along the axis. Bias - Page , Section A measurement that is repeatedly either too high or too low. Bin Width See Class Size Bi-variate Data - Page , Section Numerical data that measures two variables. Blinded Study - Page , Section A study in which the subject does not know exactly what treatment they are getting. Block Design - Page , Section A study in which subjects are divided into distinct categories with certain characteristics (for example, males and females) before being randomly assigned treatments. Box Plot (Box and Whisker Plot) - Page , Section A display in which a numerical data set is divided into quarters. The 'box' marks the middle 50% of the data and the 'whiskers' mark the upper 25% and lower 25% of the data. www.ckl2.org 320 Categorical Variable - Page , Section Variables that can be put into categories, like favorite color, type of car you own, your sports jersey number, etc... Census - Page , Section A special type of study in which data is gathered from every single member of the population. Center - Page , Section Typically, it is the mean, median, or the mode of a data set. In a normal distribution curve the mean, median, and mode all mark the center. Chance Behavior - Page , Section Events whose outcomes are not predictable in the short term, but have long term predictability. Class Size (Bin Width) - Page , Section A consistent width that all bars on a histogram have. A quick estimation of a reasonable class size is to roughly divide the range by a value from about 7 to 10. Coincidence - Page , Section A relationship between two variables that simply occurs by chance. Combination - Page , Section An arrangement of a set of object in which the order does not matter. n C r = [ , n '_ -,, Common Response - Page , Section A situation in which two variables have similar behaviors but are actually both responding to an additional lurking variable. Complement of an Event - Page , Section The probability of an event, 'A', NOT occurring. It can be thought of the opposite of an event and can be notated as A c or ~A. Compound Event - Page , Section An event with two or more steps such as drawing a card and then rolling a die. 321 www.ckl2.org Conditional Probability - Page , Section The probability of a particular outcome happening assuming a certain prerequisite condition has already been met. A clue that a conditional probability is being considered is the word 'given' or the vertical bar symbol, |. Confidence Interval - Page , Section The range of answers included within the margin of error. Typically, we use a 95% confidence interval meaning it is very likely (95% chance) that the parameter lies within this range. Confounding - Page , Section Occurs when two variables are related, but it is not a clear cause/effect relationship because there are other variables that are carrying influence in the situation. Context - Page , Section The specific realities of the situation we are considering. We often consider the labels and units when defining the context. Contingency Table See 2- Way Table Control Group - Page , Section A group in an experiment that does not receive the actual treatment, but rather receives a placebo, a known treatment, or no treatment at all. Convenience Sample - Page , Section A biased sampling method in which data is only gathered from those individuals who are easy to ask or are conveniently located. Correlation (r) - Page , Section A statistic that is used to measure the strength and direction of a linear correlation whose values range from -1 to 1. The sign of the correlation (+/-) matches the sign of the slope of the regression equation. Data - Page , Section A collection of facts, measurements, or observations about a set of individuals. www.ckl2.org 322 Density Curve - Page , Section A curve that gives a rough description of a distribution. The curve is smooth and always has an area equal to 1 whole or 100%. Dependent Events - Page , Section A situation in which one event changes the probability of another event. Direct Cause and Effect - Page , Section A situation in which one variable causes a specific effect to occur with no lurking variables. Direction - Page , Section One of three general results reported for a linear regression. It will be reported as either be positive, negative, or 0. Disjoint See Mutually Exclusive Events Dot Plot - Page , Section A simple display that places a dot above the axis for each value. There is a dot for each value, so values that occur more than once will be shown by stacked dots. Double Blind - Page , Section A study in which neither the experimenter nor the subject knows which treatment is being given. Empirical Rule (68-95-99.7 Rule) - Page , Section A rule that states that in a normal distribution, 68% of the data is located within one standard deviation from the mean, 95% of the data is located within two standard deviations from the mean, and 99.7% of the data is located within three standard deviations from the mean. Event - Page , Section Any action from which a result will be recorded or measured. Expected Value - Page , Section The average result over the long run for an event if repeated a large number of times. 323 www.ckl2.org Experiment - Page , Section A study in which the researchers impose a treatment on the subjects. Explanatory Variable - Page , Section The x-axis variable. It can often be viewed as the 'cause' variable or the independent variable. Factorial - Page , Section A number followed by an exclamation point indicated repeated multiplication down to 1. For example, 4! = 4x3x2x1. Fair Game - Page , Section A game in which neither the player nor the house has an advantage. An average player over the long run will neither gain nor lose money. In other words, the expected value of the game is the same as the cost to play the game. Five-Number Summary - Page , Section A description of data that includes the minimum, first quartile, median, third quartile, and maximum numbers which can be used to create a box plot. Form - Page , Section A general description of the pattern in a scatterplot. Typical descriptions include linear, curved, or random (no specific form). Frequency Table - Page , Section A table that shows the number of occurrences in each category. Fundamental Counting Principle - Page , Section A rule that states to find the number of outcomes for a given situation, simply multiply the number of outcomes for each individual event. Histogram - Page , Section A special bar graph for a numerical data set. In a histogram, each bar has the same width with no space between them where bars track the frequency of results in its given range. www.ckl2.org 324 Independent Events - Page , Section Two events in which the outcome of one event does not change the probabilities for the outcome for the other event. Individual - Page , Section The subject being studied. This can be a person, an animal or an object. Inter-Quartile Range (IQR) - Page , Section The distance between the lower and upper quartiles. IQR = Q3 - Qi Instrument of Measurement - Page , Section Tool used to make measurements. Typical instruments are tools like rulers, scales, thermome- ters, or speedometers. Intersection of Sets - Page , Section In a Venn Diagram, it includes the results that are members of more than one group simul- taneously. We use the symbol, n, to indicate the intersection and think of the intersection of those parts of the diagram that include both A and B. Law of Large Numbers - Page , Section A rule that states that we will eventually get closer to the theoretical probability as we greatly increase the number of times an event is repeated. Line Graph See Time Plot Lurking Variable - Page , Section An additional variable that was not taken into account in a particular situation. Margin of Error - Page , Section A range of results, often spanning from 2 standard deviations below to 2 standard deviations above the mean in which we are 95% confident that the true parameter is located. The quick method for an approximation of the margin of error for a 95% confidence interval is M.O.E = -j=. Mean (Average) - Page , Section The sum of all the numbers divided by the number of values in the data set. It is also located at the center of a normal distribution and is a good measure of center for symmetric data sets. 325 www.ckl2.org Median - Page , Section The data result in the middle of a data list that has been organized smallest to largest. If there are two middle data values, then the median is located halfway between those two values. In a visual distribution, it marks the 50/50 area point on the graph. Use for skewed data sets. Mode - Page , Section The result that appears most frequently in a data set. It also occurs at the highest point of a density curve. Multistage Random Sample - Page , Section A sampling technique that uses randomly selected sub-groups of a population before random selection of individuals occurs. Mutually Exclusive Events (Disjoint) - Page , Section Events that cannot occur at the same time. Negative Linear Association - Page , Section A situation such that as one numerical variable increases, another numerical variable decreases. Non-Response - Page , Section A non-sampling error in which subjects do not participate or do not answer questions in a survey. Normal Distribution Curve - Page , Section A bell-shaped curve that describes a symmetrical data set such that the most frequent results occur near the mean and results become less frequent as you move further from the mean. Numerical Variable - Page , Section A variable that can be assigned a numerical value, such as a height, a distance, a temperatures, etc... Observational Study - Page , Section A study in which researchers do not impose a treatment on the subjects. Data is collected by watching the subjects or from information already available. (Observe but do not disturb) www.ckl2.org 326 Outcome - Page , Section A possible result of an event. Outlier - Page , Section A value that is unusual when compared to the rest of a data set. High outliers will be greater than Q3 + I.5IQR. Low outliers will be below Qi-1.5IQR. Parallel box plots - Page , Section Multiple box plots graphed on the same axes to compare multiple data sets. Parameter - Page , Section A value that describes a truth about a population. Sometimes, the value is unknown so a parameter is often given as a description of truth. Permutation - Page , Section A specific order or arrangement of a set of objects or items. In a permutation, the order in which the items are selected matters. Pictograph - Page , Section A bar graph that uses pictures instead of bars. These graphs can be misleading because pictures measure height and width, where bar graphs measure only height. To be effective, all the pictures used must be the same size. Pie chart - Page , Section A graph which shows each category as a part of the whole in a circle graph. Pie charts can be used if exactly 100% of the results for a particular situation are known. Placebo - Page , Section A fake treatment that is similar in appearance to the real treatment. Placebo Effect - Page , Section The placebo effect occurs when a subject starts to experience changes simply because they believe they are receiving a treatment. Population - Page , Section The entire group of individuals we are interested in. 327 www.ckl2.org Positive Linear Association - Page , Section A situation such that as one numerical variable increases, the other numerical variable also increases. Prime Number - Page , Section A number that is divisible only by 1 and itself. Remember, 1 is not a prime number! Probability - Page , Section The likelihood of a particular outcome occurring. Probability Model - Page , Section A table that lists all outcomes of an event and their respective probabilities. The sum of all the probabilities in a probability model must equal 1. Processing Errors - Page , Section An error commonly made due to issues like poor calculations or inaccurate recording of results. Prospective Studies - Page , Section A study which follows up with study subjects in the future in an effort to see if there were any long-term effects. Quartile 1 - Page , Section The median of all the values to the left of the median. Do not include the median itself. Quartile 3 - Page , Section The median of all the values to the right of the median. Do not include the median itself. Random Event - Page , Section An event for which we can not be certain of the outcome. Table of Random Digits - Page , Section A long list of randomly chosen digits from to 9, usually generated by computer software or calculators. A table of random digits can be found in Appendix A, Part 1. www.ckl2.org 328 Random Sampling Error - Page , Section Even though a sample is randomly selected, it is entirely possible that a particular result within the population will be over-represented. Larger sample sizes reduce random sampling error. The margin of error is stated with most studies to account for random sampling error. Range - Page , Section A basic description of how spread out a data set is. It is calculated by subtracting the smallest number in a data set from the largest number in the data set. Reliability - Page , Section How consistently a particular measurement technique gives the same, or nearly the same mea- surement. Response Bias - Page , Section Occurs when an individual responds to a survey with an incorrect or untruthful answer. This type of bias can frequently happen when questions are potentially sensitive or embarrassing. Response Variable - Page , Section The y-axis variable. It can often be thought of as the 'effect' variable or dependent variable. Retrospective Study - Page , Section A study in which information about a subject's past is used in the study. Sample - Page , Section A representative subset of a population. Sample Space - Page , Section A list of all the possible outcomes that may occur. Sample Survey - Page , Section A survey that uses a subset of the population in order to try to make predictions about the entire population. Sampling Frame - Page , Section A list of all members of a population. 329 www.ckl2.org Scatterplot - Page , Section Graphs that represent a relationship between two numerical variables where each data point is shown as a coordinate point on a scaled grid. SCOFD This is used for the description of a scatterplot and stands for Strength, Context, Outliers, Form, and Direction. Simple Random Sample (SRS) - Page , Section A sample where all possible groups of a particular size are equally possible. It can be thought of as putting all members of the population in a hat and randomly drawing until the desired sample size is reached. Simulation - Page , Section A model of a real situation that can be used to make predictions about what might really happen. Often, tables of random digits are used to carry out simulations. Skewed Distribution - Page , Section A distribution in which the majority of the data is concentrated on one end of the distribution. Visually, there is a 'tail' on the side with less data and this is the direction of the skew. SOCCS - Page , Section A way to remember the key information to discuss for a distribution: Shape, Outliers, Center, Context, and Spread. Spread - Page , Section A way to measure variability of a data set. Common measures of spread are the range, standard deviation, and IQR. Standard deviation - Page , Section A measure of spread relative to the mean of a data set. Use this measurement for any data set which is approximately normally distributed. Statistic - Page , Section A number that describes results from sample. This number is often used to make an approxi- mation of the parameter. www.ckl2.org 330 Stem plot - Page , Section A method of organizing data that sorts the data in a visual fashion. The stem is made up of all the leading digits of a piece of data and the leaf is the final digit. Stratified Random Sample - Page , Section A sample in which the population is divided into distinct groups called strata before a random sample is chosen from each strata. Strength - Page , Section One of three measurements reported for a best-fit line that describes how closely the data matches a perfect line. Subjects - Page , Section The individuals that are being studied in an experiment. Symmetrical Distribution - Page , Section A distribution in which the left side of the distribution looks like the mirror image of the right side of the distribution. Systematic random sample - Page , Section A sampling method in which the first selection is made randomly and then a 'system' is used to make the remaining selections. Theoretical Model - Page , Section A model that gives a picture of exactly the frequencies of what should happen in a situation involving probability. Theoretical probability - Page , Section A mathematical calculation of the likelihood that an event will occur. Time Plot (Line Graph) - Page , Section A graph that shows how a variable changes over time. Tree Diagram - Page , Section A visual representation of a series of events where each successive event branches off from the previous event. 331 www.ckl2.org Two- Way Table (Contingency Table) - Page , Section A table which tracks two characteristics from a set of individuals. For example, we might track gender and grade of all the students in your high school. Undercoverage - Page , Section A sampling error in which an entire group or groups of subjects are left out or underrepresented in a study. Union of Sets - Page , Section A union includes all results that are in either one category, another category, or both categories in a Venn diagram. We use the symbol U and can think of a union as either A or B (or both). Validity - Page , Section A measurement technique is valid if it is an appropriate way to collect data. Variables - Page , Section Characteristics about the individuals that the researchers might be interested in. Venn Diagrams - Page , Section Diagrams that represent outcomes using intersecting circles. Voluntary Response Sample - Page , Section A biased sampling method in which participants get to choose whether or not to participate in the survey. The bias occurs because those who are most passionate about an issue will be more likely to respond. Wording of a Question - Page , Section The wording of a question can be used to manipulate subjects to make them more likely to respond a certain way in a survey causing bias. Z-Score - Page , Section A measure of the number of standard deviations a particular data point is away from the mean in a normal distribution. If a z-score is positive, the value is larger than the mean and if it is negative, it is less than the mean. www.ckl2.org 332 8.3 Appendix C - Calculator Help This appendix is not meant to be a full guide for calculators common to students who take this course. Rather, it is intended to highlight some of the key procedures used on a variety of different calculators. The steps are arranged by topic as opposed to being arranged by calculator. One online source that is helpful for those of you with graphing calculator issues can be found on the Prentice Hall website at http://www.prenhall.com/divisions/esm/app/calc_v2/ . Topic 1 - Combinations and Permutations 1 . Type the n 2. Press the prb button 3. Arrow to nPr or nCr 4. Hit ENTER 5. Type the r 6. Hit ENTER 1 . Type the n 2. Press the xoy button 3. Type the r 4. Press 2nd for nPr or 3rd for nCr 5. Press the — * button Figure 8.3 1 . Type the n 2. Press 2nd 3. Press 8 for nCr or9fornPr 4. Type the r 5. Hit = 1 . Type the n 2. Press the MATH button 3. Arrow right to PRB 4. Arrow down to nPr or nCr 5. Hit ENTER 6. Type the r 7. Hit ENTER Figure 8.4 1 . Type the n 2. Press SHIFT 3. Press x for nPr or + for nCr 4. Type the r 5. Hit = 1 . Type the n 2. Press the nCr button or SHIFT then the nCR button for nPr 3. Type the r 4. Hit = Figure 8.5 333 www.ckl2.org Topic 2 - Random Number Generators 1 . Press the prb button 2. Arrow to rand 3. Hit ENTER 5. Type the lowest #,largest #) 6. Hit ENTER 1. Press the MATH button 2. Arrow right to PRB 3. Arrow down to randlnt 4. Hit ENTER 5. Type the lowest #,largest #, the # of digits you want) 6. Hit ENTER Figure 8.6 Topic 3 - Means and Standard Deviations 1 . Press 2nd and the DATA button for STAT 2. Make sure 1-VAR is underlined then Hit = 3. Press the DATA button again 4. Type your 1st piece of data into x'= then hit the down arrow. Leave FRQ=1. 5. Keep arrowing down until all the data is in then Hit = 6. Press the STATVAR button and arrow right to find mean (x) and standard deviation (s ) Figure 8.7 1 . Press 3rd then x*t> y button for STAT 1 2. Type your 1st piece of data then hit the 1+ button 3. Keep doing this until your last piece of data is in and you've hit 1+ 4. Press 2nd and the x 2 button for mean (x) or 2nd and the Vx button for standard deviation (axn-1) Figure 8.8 1 . Press 2nd then 7 for CSR 2. Type your 1 st piece of data then hit the 1+ button 3. Keep doing this until your last piece of data is in and you've hit 1+ 4. Press 2nd and the x 2 button for mean (x) or 2nd and the Vx button for standard deviation (oxn-1) Figure 8.9 www.ckl2.org 334 Press the DATA button Type your data into Li Press 2nd then the DATA button again for STAT Highlight 1 :1-Var Stats then hit ENTER Highlight Li then hit ENTER Highlight Frq 1 then hit ENTER Highlight CALC then hit ENTER 8. Arrow down to find mean (x) and standard deviation (s x ) Figure 8.10 1 . Press the STAT button 2. Choose 1 : Edit 3. Type your data into Li 4. Press the STAT button again 5. Arrow to the right for CALC 6. Choose 1: 1-Var Stats 7. Press ENTER again 8. Find mean (x) and standard deviation (s x ) Figure 8.11 1 . Press the MODE button 2. Choose STAT 3. Press 1 for 1-VAR 4. Type your data in the x list 5. When all the data is in press the AC button 6. Press SHIFT 7. Press 1 for STAT 8. Press SforVar 9. Press 2 for mean (x) or 4 for standard deviation (xan-1) 10. Hit = Figure 8.12 335 www.ckl2.org Topic 4 - Correlation, Slopes, and Intercepts 1. Press 2nd then the DATA button for STAT 2. Make sure 2-VAR is underlined then Hit = 3. Press the DATA button 4. Type your 1st x-value into x 1 = then arrow down and type your 1st y-value into y 5 = 5. Keep arrowing down until all your data is entered in then Hit = 6. Press the STATVAR button and arrow right to find slope (a) or y-intercept (b) or correlation (r) *NOTE: a & b mean the opposite here Figure 8.13 . Press the DATA button . Type your x data into Li and your y data into Li . Press 2nd then the DATA button again for STAT . Highlight 2: 2-Var Stats then hit ENTER . Highlight Li for x then hit ENTER . Highlight L> for y then hit ENTER . Highlight CALC then hit ENTER . Arrow down to find slope (a) or y-intercept (b) or correlation (r) *NOTE: a & b mean the opposite here Figure 8.14 BOO OOO ISO,) 1 . Press the STAT button 2. Choose 1: Edit 3. Type your x-values into Li and your y-values into I_2 4. Press the STAT button again 5. Arrow to the right for CALC 6. Choose 8: LinReg{a+bx) 7. Press ENTER again 8. Find y-intercept (a) or slope (b) or correlation (r) *NOTE: If r does not appear, then you need to follow these steps once to turn it on. 1 . Press 2nd then the for CATALOG 2. Arrow down until DiagnosticOn is highlighted 3. Press ENTER until the screen shows Done Figure 8.15 www.ckl2.org 336 ODD ano 1 . Press 3rd then the 1+ button for STAT 2 2. Type your 1st x-value then hit the x-«*y button now type your 1st y-value then hit the E+ button 3. A 1 will show up telling you you've entered your first pair of data. Continue this process until all your data is entered. 4. Press 3rd then 4 for correlation (COR) or 2nd then 4 for y-intercept (ITC) or 2nd then 5 for slope (SLP) Figure 8.16 1. Press the MODE button 2. Choose STAT 3. Press 2 again for A+ BX 4. Type your x and y values into the lists 5. When all the data is in press the AC button 6. Press SHIFT 7. Press 1 for STAT 8. Press 7 for Reg 9. Press 1 for y-intercept (A) or 2 for slope (B) or 3 for correlation (r) 10. Hit = Figure 8.17 337 www.ckl2.org Topic 5 - Normal Distributions Your graphing calculator has already been programmed to calculate probabilities for a normal density curve using what is called a cumulative density function orcdf. This is found in £hjg distributions menu above theVARS key. Press [2nd] [VARS], [2] to select the normalcdf) To get normalcdf to work properly, you will need to enter the two values that you are looking at followed by the mean and standard deviation. Your screen should look something like the following: oiai;. ORAM I :norr*alpdf C gBnori*alcdf<: J: inuNorn< 4: invT< 5:tPdf< 6-:t&df< ?4-X*pdf< normalcdf (a,b,n,o) Figure 8.18 Your TI - 83/84 graphing calculators have already been programmed to find values at certain percentiles in a normal curve. This feature is called invNorm and can be found in the Distribution Menu. Press [2nd] [VARS], [3] to select the invNorm To get invNorm to work properly, you will need to enter the percentile followed by the mean and standard deviation. Your screen should look something like the following (you must enter the percent as a decimal]: iltai:- dram iJnornalpdf < 2: normalcdf < fflinvNorn< 4ttpdf< 5:tcdf< 6:X2pdf( 74-X ?cdf< invNorm (%,fi,a) Figure 8.19 Image References Random Digit Table http://uwsp.edu/niath Normal Distribution Table http://www.regentsprep.org www.ckl2.org 338